### Methods for Reducing Bias in Sampling

#### Randomization

The best technique for reducing bias in sampling is **randomization**. When a **simple random sample** of size \begin{align*}n\end{align*} (commonly referred to as an SRS) is taken from a population, all possible samples of size \begin{align*}n\end{align*} in the population have an equal probability of being selected for the sample.

If your statistics teacher wants to choose a student at random for a special prize, he or she could simply place the names of all the students in the class in a hat, mix them up, and choose one. More scientifically, your teacher could assign each student in the class a number from 1 to 25 (assuming there are 25 students in the class) and then use a computer or calculator to generate a random number to choose one student. This would be a simple random sample of size 1.

#### Systematic **Sampling**

There are other types of samples that are not simple random samples, and one of these is a systematic sample. In **systematic sampling**, after choosing a starting point at random, subjects are selected using a jump number. If you have ever chosen teams or groups in gym class by counting off by threes or fours, you were engaged in systematic sampling. The jump number is determined by dividing the population size by the desired sample size to insure that the sample combs through the entire population. If we had a list of everyone in your class of 25 students in alphabetical order, and we wanted to choose 5 of them, we would choose every \begin{align*}5^{\text{th}}\end{align*} student. Let's try choosing a starting point at random by generating a random number from 1 to 25 as shown below:

In this case, we would start with student number 14 and then select every \begin{align*}5^{\text{th}}\end{align*} student until we had 5 in all. When we came to the end of the list, we would continue the count at number 1. Thus, our chosen students would be: 14, 19, 24, 4, and 9. It is important to note that this is not a simple random sample, as not every possible sample of 5 students has an equal chance of being chosen. For example, it is impossible to have a sample consisting of students 5, 6, 7, 8, and 9.

**Cluster** Sampling

**Cluster sampling** is when a naturally occurring group is selected at random, and then either all of that group, or randomly selected individuals from that group, are used for the sample. If we select at random from out of that group, or cluster into smaller subgroups, this is referred to as **multi-stage sampling**.

To survey student opinions or study their performance, we could choose 5 schools at random from your state and then use an SRS (simple random sample) from each school. If we wanted a national survey of urban schools, we might first choose 5 major urban areas from around the country at random, and then select 5 schools at random from each of these cities. This would be both cluster and multi-stage sampling. Cluster sampling is often done by selecting a particular block or street at random from within a town or city. It is also used at large public gatherings or rallies. If officials take a picture of a small, representative area of the crowd and count the individuals in just that area, they can use that count to estimate the total crowd in attendance.

**Stratified** Sampling

In **stratified sampling**, the population is divided into groups, called strata (the singular term is 'stratum'), that have some meaningful relationship. Very often, groups in a population that are similar may respond differently to a survey. In order to help reflect the population, we stratify to insure that each opinion is represented in the sample.

We often stratify by gender or race in order to make sure that the often divergent views of these different groups are represented. In a survey of high school students, we might choose to stratify by school to be sure that the opinions of different communities are included. If each school has an approximately equal number of students, then we could simply choose to take an SRS of size 25 from each school. If the numbers in each stratum are different, then it would be more appropriate to choose a fixed sample (100 students, for example) from each school and take a number from each school proportionate to the total school size.

### Technology Note: Generating Random Numbers on the TI-83/84 Calculator

Your graphing calculator has a random number generator. Press **[MATH]** and move over to the **PRB** menu, which stands for probability. (Note: Instead of pressing the right arrow three times, you can just use the left arrow once!) Choose '1:rand' for the random number generator and press **[ENTER]** twice to produce a random number between 0 and 1. Press **[ENTER]** a few more times to see more results.

It is important that you understand that there is no such thing as true randomness, especially on a calculator or computer. When you choose the 'rand' function, the calculator has been programmed to return a ten digit decimal that, using a very complicated mathematical formula, simulates randomness. Each digit, in theory, is equally likely to occur in any of the individual decimal places. What this means in practice is that if you had the patience (and the time!) to generate a million of these on your calculator and keep track of the frequencies in a table, you would find there would be an approximately equal number of each digit. However, two brand-new calculators will give the exact same sequences of random numbers! This is because the function that simulates randomness has to start at some number, called a *seed value*. All the calculators are programmed from the factory (or when the memory is reset) to use a seed value of zero. If you want to be sure that your sequence of random digits is different from everyone else’s, you need to seed your random number function using a number different from theirs. Type a unique sequence of digits on the home screen, press **[STO]**, enter the 'rand' function, and press **[ENTER]**. As long as the number you chose to seed the function is different from everyone else's, you will get different results.

Now, back to our example. If we want to choose a student at random between 1 and 25, we need to generate a random integer between 1 and 25. To do this, press **[MATH][PRB]** and choose the 'randInt(' function.

The syntax for this command is as follows:

'RandInt(starting value, ending value, number of random integers)'

The default for the last field is 1, so if you only need a single random digit, you can enter the following:

In this example, the student chosen would be student number 7. If we wanted to choose 5 students at random, we could enter the command shown below:

However, because the probability of any digit being chosen each time is independent from all other times, it is possible that the same student could get chosen twice, as student number 10 did in our example.

What we can do in this case is ignore any repeated digits. Since student number 10 has already been chosen, we will ignore the second 10. Press **[ENTER]** again to generate 5 new random numbers, and choose the first one that is not in your original set.

In this example, student number 4 has also already been chosen, so we would select student number 14 as our fifth student.

### Example

#### Example 1

In San Francisco, there are 5 Math Circle math clubs, each with a different number of students. If we wanted to do a study to determine whether the students in these clubs improve the students' math perform, how would you design the study to reduce bias?

If you did a SRS of all students, you might get many students from one club. This might bias your results, depending on how different the clubs are from each other. In order to avoid bias from the differences of the clubs, you should take a stratified random sample of students, where the clubs are the strata. If one club has one tenth of the students in the total population of students in all math clubs, then approximately one tenth of your sample should come from that club.

### Review

For questions 1-5, an amusement park wants to know if its new ride, The Pukeinator, is too scary. Explain the type(s) of bias most evident in each sampling technique and/or what sampling method is most evident. Be sure to justify your choice.

- The first 30 riders on a particular day are asked their opinions of the ride.
- The name of a color is selected at random, and only riders wearing that particular color are asked their opinion of the ride.
- A flier is passed out inviting interested riders to complete a survey about the ride at 5 pm that evening.
- Every \begin{align*}12^{\text{th}}\end{align*} teenager exiting the ride is asked in front of his friends: “You didn’t think that ride was scary, did you?”
- Five riders are selected at random during each hour of the day, from 9 AM until closing at 5 PM.

For 6-10, There are 35 students taking statistics in your school, and you want to choose 10 of them for a survey about their impressions of the course. Assume the students are assigned numbers from 1 to 35, decide which students are chosen for the sample. Use your calculator to select a simple random sample of the size specified. Make sure to start with a different random seed each time.

- A SRS of 10 students. (Seed your random number generator with the number 10 before starting.)
- A SRS of 6 students. (Seed your random number generator with a different number before starting.)
- A SRS of 5 students. (Seed your random number generator with a different number before starting.)
- A SRS of 11 students. (Seed your random number generator with a different number before starting.)
- A SRS of 3 students. (Seed your random number generator with a different number before starting.)

**References**

The New York Times

U.S. GAO

CNN

Wikipedia

### Review (Answers)

To view the Review answers, open this PDF file and look for section 6.2.