12.8: Surveys and Samples
Learning Objectives
- Classify sampling methods.
- Identify biased samples.
- Identify biased questions.
- Design and conduct a survey.
- Display, analyze, and interpret statistical survey data.
Introduction
One of the most important applications of statistics is collecting information. Statistical studies are done for many purposes: A government agency may want to collect data on weather patterns. An advertisement company might seek information about what people buy. A consumer group could conduct a statistical study on gas consumption of cars, and a biologist may study primates to find out more about animal behaviors. All of these applications and many more rely on the collection and analysis of information.
One method to collect information is to conduct a census. In a census, information is collected on all the members of the population of interest. For example, when voting for a class president at school every person in the class votes, so this represents a census. With this method, the whole population is polled.
When the population is small (as in the case of voting for a class president) it is sensible to include everyone’s opinion. Conducting a census on a very large population can be very time-consuming and expensive. In many cases, a census is impractical. An alternate method for collecting information is by using a sampling method. This means that information is collected from a small sample that represents the population with which the study is concerned. The information from the sample is then extrapolated to the population.
Classify Sampling Methods
When a statistical study is conducted through a sampling method, we must first decide how to choose the sample population. It is essential that the sample is a representative sample of the population we are studying. For example, if we are trying to determine the effect of a drug on teenage girls, it would make no sense to include males in our sample population, nor would it make sense to include women that are not teenagers. The word population in statistics means the group of people who we wish to study.
There are several methods for choosing a population sample from a larger group. The two main types of sampling are random sampling and stratified sampling.
Random sampling
This method simply involves picking people at random from the population we wish to poll. However, this does not mean we can simply ask the first 50 people to walk by in the street. For instance, if you are conducting a survey on people’s eating habits you will get different results if you were standing in front of a fast-food restaurant than if were standing in front of a health food store. In a true random sample, everyone in the population must have the same chance of being chosen. It is important that each person in the population has a chance of being picked.
Stratified Sampling
This method of sampling actively seeks to poll people from many different backgrounds. The population is first divided into different categories (or strata) and the number of members in each category is determined. Gender and age groups are commonly used strata, but others could include salary, education level or even hair color. Each person in a given stratum must share that same characteristic. The sample is made up of members from each category in the same proportion as they are in the population. Imagine you are conducting a survey that calls for a sample size of 100 people. If it is determined that 10% of the population are males between the ages of 10 and 25, then you would seek 10 males in that age group to respond. Once those 10 have responded no more males between 10 and 25 may take part in the survey.
Sample Size
In order for sampling to work well, the sample size must be large enough so as to lessen the effect of a biased sample. For example, if you randomly sample 6 children, there is a chance that many or all of them will be boys. If you randomly sample 6,000 children it is far more likely that they will be approximately equally spread between boys and girls. Even in stratified sampling (when we would poll equal numbers of boys and girls) it is important to have a large enough sample to include the entire spectrum of people and viewpoints. The sample size is determined by the precision desired for the population. The larger the sample size is, the more precise the estimate is. However, the larger the sample size, the more expensive and time consuming the statistical study becomes. In more advanced statistics classes you will learn how to use statistical methods to determine the best sample size for a desired precision on the population.
Example 1
For a class assignment you have been asked to find if students in your school are planning to attend university after graduating high-school. Students can respond with “yes”, “no” or “undecided”. How will you choose those you wish to interview if you want your results to be reliable?
Solution
The best method for obtaining a representative sample would be to apply stratified sampling. An appropriate category for stratifying the population would be grade level since students in the upper grades might be more sure of their post-graduation plans than students in the lower grades.
You will need to find out what proportion of the total student population is included in each grade, then interview the same proportion of students from each grade when conducting the survey.
Identify Biased Samples
Once we have identified our population, it is important that the sample we choose accurately reflect the spread of people present in the population. If the sample we choose ends up with one or more sub-groups that are either over-represented or under-represented, then we say the sample is biased. We would not expect the results of a biased sample to represent the entire population, so it is important to avoid selecting a biased sample. Stratified sampling helps, but does not always eliminate bias in a sample. Even with a large sample size, we may be consistently picking one group over another.
Some samples may deliberately seek a biased sample in order to bolster a particular viewpoint. For example, if a group of students were trying to petition the school to allow eating candy in the classroom, they may only survey students immediately before lunchtime when students are hungry. The practice of polling only those who you believe will support your cause is sometimes referred to as cherry picking.
Many surveys may have a non-response bias. In this case, a survey that is simply handed out en-masse elicits few responses when compared to the number of surveys given out. People who are either too busy or simply not interested will be excluded from the results. Non-response bias may be reduced by conducting face-to-face interviews.
Self-selected respondents who tend to have stronger opinions on subjects than others and are more motivated to respond. For this reason phone-in and online polls also tend to be poor representations of the overall population. Even though it appears that both sides are responding, the poll may disproportionately represent extreme viewpoints from both sides, while ignoring more moderate opinions which may, in fact, be the majority view. Self selected polls are generally regarded as unscientific.
Examples of biased samples.
The following text is adapted from Wikipedia http://www.wikipedia.org/wiki/Biased_sample
A classic example of a biased sample occurred in the 1948 Presidential Election. On Election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN, which turned out to be mistaken. In the morning, the grinning President-Elect, Harry S. Truman, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses.
Example 2
Identify each sample as biased or unbiased. If the sample is biased explain how you would improve your sampling method.
a. Asking people shopping at a farmer’s market if they think locally grown fruit and vegetables are healthier than supermarket fruits and vegetables.
b. You want to find out public opinion on whether teachers get paid a sufficient salary by interviewing the teachers in your school.
c. You want to find out if your school needs to improve its communications with parents by sending home a survey written in English.
Solution
a. This would be a biased sample because people shopping a farmer’s market are generally interested in buying fresher fruits and vegetables than a regular supermarket provides. The study can be improved by interviewing an equal number of people coming out of a supermarket, or by interviewing people in a more neutral environment such as the post office.
b. This is a biased sample because teachers generally feel like they should get a higher salary. A better sample could be obtained by constructing a stratified sample with people in different income categories.
c. This is a biased sample because only English-speaking parents would understand the survey. This group of parents would be generally more satisfied with the school’s communications. The study could be improved by sending different versions of the survey written in languages spoken at the students’ homes.
Identify Biased Questions
When you are creating a survey, you must think very carefully about the questions you should ask, how many questions are appropriate and even the order in which the questions should be asked. A biased question is a question that is worded in such a way (whether intentional or not) that it causes a swing in the way people answer it. Biased questions can lead a representative, non-biased population sample answering in a way that does not accurately reflect the larger population. While biased questions are a bad way to judge the overall mood of a population, they are sometimes linked to advertising companies conducting surveys to suggest that one product performs better than others. They could also be used by political campaigners to give the impression that some policies are more popular than is actually the case.
There are several ways to spot biased questions.
They may use polarizing language, words and phrases that people associate with emotions.
- Is it right that farmers murder animals to feed people?
- How much of your time do you waste on TV every week?
- Should we be able to remove a person’s freedom of choice over cigarette smoking?
They may refer to a majority or to a supposed authority.
- Would you agree with the American Heart and Lung Association that smoking is bad for your health?
- The president believes that criminals should serve longer prison sentences. Do you agree?
- Do you agree with 90% of the public that the car on the right looks better?
The question may be phrased so as to suggest the person asking the question already knows the answer to be true, or to be false.
- It’s OK to smoke so long as you do it on your own, right?
- You shouldn’t be forced to give your money to the government, should you?
- You wouldn’t want criminals free to roam the streets, would you?
The question may be phrased in ambiguous way (often with double negatives) which may confuse people.
- Do you reject the possibility that the moon landings never took place?
- Do you disagree with people who oppose the ban on smoking in public places?
In addition to biased questions, a survey may exhibit bias from other aspects of how it is designed. In particular question order can play a role. For example a survey may contain several questions on people’s attitudes to cigarette smoking. If the question “What, in your opinion, are the three biggest threats to public health today?” is asked at the end of the survey it is likely that the answer “smoking” is given more often than if the same question is asked at the start of the survey.
Design and Conduct a Survey
One way of collecting information from a population is to conduct a survey. A survey is a way to ask a lot of people a few well-constructed questions. The survey is a series of unbiased questions that the subject must answer. Some advantages of surveys are that they are efficient ways of collecting information from a large number of people, they are relatively easy to administer, a wide variety of information can be collected and they can be focused (only questions of interest to the researcher are asked, recorded, codified and analyzed). Some disadvantages of surveys arise from the fact that they depend on the subjects’ motivation, honesty, memory and ability to respond. In addition, although the chosen sample to be surveyed is unbiased, there might be errors due to the fact that the people who choose to respond on the survey might not form an unbiased sample. Moreover, answer choices to survey questions could lead to vague data. For example, the choice “moderately agree” may mean different things to different people or to anyone interpreting the data.
Conducting a Survey
There are various methods for administering a survey. It can be done as a face-to face interview or a phone interview where the researcher is questioning the subject. A different option is to have a self-administered survey where the subject can complete a survey on paper and mail it back, or complete the survey online. There are advantages and disadvantages to each of these methods.
Face-to-face interviews
The advantages of face-to-face interviews are that there are fewer misunderstood questions, fewer incomplete responses, higher response rates, greater control over the environment in which the survey is administered and the fact that additional information can be collected from the respondent. The disadvantages of face-to-face interviews are that they can be expensive and time-consuming and may require a large staff of trained interviewers. In addition, the response can be biased by the appearance or attitude of the interviewer.
Self-administered surveys
The advantages of self-administered surveys are that they are less expensive than interviews, do not require a large staff of experienced interviewers and they can be administered in large numbers. In addition, anonymity and privacy encourage more candid and honest responses and there is less pressure on respondents. The disadvantages of self-administered surveys are that responders are more likely to stop participating mid-way through the survey and respondents cannot ask for clarification. In addition, there are lower response rates that in personal interviews and often respondents returning survey represent extremes of the population – those people who care about the issue strongly at both extremes.
Design a Survey
Surveys can take different forms. They can be used to ask only one question or they can ask a series of questions. We use surveys to test out people’s opinions or to test a hypothesis.
When designing a survey, we must keep the following guidelines in mind.
- Determine the goal of your survey, What question do you want to answer?
- Identify the sample population. Who will you interview?
- Choose an interviewing method, face-to-face interview, phone interview or self-administered paper survey or internet survey.
- Conduct the interview and collect the information.
- Analyze the results by making graphs and drawing conclusions.
Example 3
Martha wants to construct a survey that shows which sports students at her school like to play the most.
- List the goal of the survey.
- What population sample should she interview?
- How should she administer the survey?
- Create a data collection sheet that she can use the record your results.
Solution
- The goal of the survey is to find the answer to the question: “Which sports do students at Martha’s school like to play the most?”
- A sample of the population would include a random sample of the student population in Martha’s school. A good stategy would be to randomly select students (using dice or a random number generator) as they walk into an all school assembly.
- Face-to-face interviews are a good choice in this case since the survey consists of only one question which can be quickly answered and recorded.
- In order to collect the data to this simple survey Martha can design a data collection sheet such as the one below:
Sport | Tally |
---|---|
baseball | |
basketball | |
football | |
soccer | |
volleyball | |
swimming |
This is a good, simple data collection sheet because:
- Plenty of space is left for the tally marks.
- Only one question is being asked.
- Many possibilities are included but space is left at the bottom for choices. that students will give that were not originally included in the data collection sheet.
- The answer from each interviewee can be quickly collected and then the data collector can move on to the next person.
Once the data has been collected, suitable graphs can be made to display the results.
Example 4
Raoul wants to construct a survey that shows how many hours per week the average student at his school works.
- List the goal of the survey.
- What population sample will he interview?
- How would he administer the survey?
- Create a data collection sheet that Raoul can use the record your results.
Solution
- The goal of the survey is to find the answer to the question “How many hours per week do you work?”
- Raoul suspects that older students might work more hours per week than younger students. He decides that a stratified sample of the student population would be appropriate in this case. The strata are grade levels \begin{align*}9^{th}\end{align*}
9th through \begin{align*}12^{th}\end{align*}. He would need to find which proportion of the students in his school are in each grade level, and then include the same proportions in his sample. - Face-to-face interviews are a good choice in this case since the survey consists of two short questions which can be quickly answered and recorded.
- In order to collect the data for this survey Raoul designed the data collection sheet shown below:
Grade Level | Record Number of Hours Worked | Total number of students |
---|---|---|
\begin{align*}9^{th}\end{align*} grade | ||
\begin{align*}10^{th}\end{align*} grade | ||
\begin{align*}11^{th}\end{align*} grade | ||
\begin{align*}12^{th}\end{align*} grade |
This data collection sheet allows for the collection of the actual numbers of hours worked per week by students as opposed to just collecting tally marks for several categories.
Display, Analyze, and Interpret Statistical Survey Data
In the previous section we considered two examples of surveys you might conduct in your school. The first one was designed to find the sport that students like to play the most. The second survey was designed to find out how many hours per week students worked.
For the first survey, students’ choices fit neatly into separate categories. Appropriate ways to display the data would be a pie-chart or a bar-graph. Let us now revisit this example.
Example 5
In Example 3 Martha interviewed 112 students and obtained the following results.
Sport | Tally | |
---|---|---|
Baseball | 31 | |
Basketball | 17 | |
Football | 14 | |
Soccer | 28 | |
Volleyball | 9 | |
Swimming | 8 | |
Gymnastics | 3 | |
Fencing | 2 | |
Total 112 |
a. Make a bar graph of the results showing the percentage of students in each category.
b. Make a pie chart of the collected information, showing the percentage of students in each category.
Solution
a. To make a bar graph, we list the sport categories on the \begin{align*}x-\end{align*}axis and let the percentage of students be represented by the \begin{align*}y-\end{align*}axis.
To find the percentage of students in each category, we divide the number of students in each category by the total number of students surveyed.
The height of each bar represents the percentage of students in each category. Here are those percentages.
Sport | Percentage |
---|---|
Baseball | \begin{align*} \frac {31}{112}=.28=28\%\end{align*} |
Basketball | \begin{align*} \frac {17}{112}=.15=15\%\end{align*} |
Football | \begin{align*} \frac {14}{112}=.125=12.5\%\end{align*} |
Soccer | \begin{align*} \frac {28}{112}=.25=25\%\end{align*} |
Volleyball | \begin{align*} \frac {9}{112}=.08=8\%\end{align*} |
Swimming | \begin{align*} \frac {8}{112}=.07=7\%\end{align*} |
Gymnastics | \begin{align*} \frac {3}{112}=.025=2.5\%\end{align*} |
Fencing | \begin{align*} \frac {2}{112}=.02=2\%\end{align*} |
b. To make a pie chart, we find the percentage of the students in each category by dividing the number of students in each category as in part a. The central angle of each slice of the pie is found by multiplying the percentage of students in each category by 360 degrees (the total number of degrees in a circle). To draw a pie-chart by hand, you can use a protractor to measure the central angles that you find for each category.
Sport | Percentage | Central Angle |
---|---|---|
Baseball | \begin{align*}\frac {31}{112}=.28=28\%\end{align*} | \begin{align*}.28\times 360^\circ =101^\circ \end{align*} |
Basketball | \begin{align*} \frac {17}{112}=.15=15\%\end{align*} | \begin{align*} .15\times 360^\circ =54^\circ \end{align*} |
Football | \begin{align*} \frac {14}{112}=.125=12.5\%\end{align*} | \begin{align*} .125\times 360^\circ =45^\circ \end{align*} |
Soccer | \begin{align*} \frac {28}{112}=.25=25\%\end{align*} | \begin{align*} .25\times 360^\circ =90^\circ \end{align*} |
Volleyball | \begin{align*}\frac {9}{112}=.08=8\%\end{align*} | \begin{align*}.08\times 360^\circ =29^\circ \end{align*} |
Swimming | \begin{align*}\frac {8}{112}=.07=7\%\end{align*} | \begin{align*}.07\times 360^\circ =25^\circ \end{align*} |
Gymnastics | \begin{align*}\frac {3}{112}=.025=2.5\%\end{align*} | \begin{align*}.025\times 360^\circ =9^\circ \end{align*} |
Fencing | \begin{align*} \frac {2}{112}=.02=2\%\end{align*} | \begin{align*}.02\times 360^\circ =7^\circ \end{align*} |
Here is the pie-chart that represents the percentage of students in each category:
For the second survey, actual numerical data can be collected from each student. In this case we can display the data using a stem-and-leaf plot, a frequency table, a histogram, and a box-and-whisker plot.
Example 6
In Example 4, Raoul found that that 30% of the students at his school are in \begin{align*}9^{th}\end{align*} grade, 26% of the students are in the \begin{align*}10^{th}\end{align*} grade, 24% of the students are in \begin{align*}11^{th}\end{align*} grade and 20% of the students are in the \begin{align*}12^{th}\end{align*} grade. He surveyed a total of 60 students using these proportions as a guide for the number of students he interviewed from each grade. Raoul recorded the following data.
Grade Level | Record Number of Hours Worked | Total Number of Students |
---|---|---|
\begin{align*}9^{th}\end{align*} grade | 0, 5, 4, 0, 0, 10, 5, 6, 0, 0, 2, 4, 0, 8, 0, 5, 7, 0 | 18 |
\begin{align*}10^{th}\end{align*} grade | 6, 10, 12, 0, 10, 15, 0, 0, 8, 5, 0, 7, 10, 12, 0, 0 | 16 |
\begin{align*}11^{th}\end{align*} grade | 0, 12, 15, 18, 10, 0, 0, 20, 8, 15, 10, 15, 0, 5 | 14 |
\begin{align*}12^{th}\end{align*} grade | 22, 15, 12, 15, 10, 0, 18, 20, 10, 0, 12, 16 | 12 |
- Construct a stem-and-leaf plot of the collected data.
- Construct a frequency table with bin size of 5.
- Draw a histogram of the data.
- Find the five number summary of the data and draw a box-and-whisker plot.
Solution
1. The ordered stem-and-leaf plot looks as follows:
We can easily see from the stem-and-leaf plot that the mode of the data is 0. This makes sense because many students do not work in high-school.
2. We construct the frequency table with a bin size of 5 by counting how many students fit in each category.
Hours worked | Frequency |
---|---|
\begin{align*}0 \le x < 5\end{align*} | 23 |
\begin{align*}5 \le x < 10\end{align*} | 12 |
\begin{align*}10 \le x < 15\end{align*} | 13 |
\begin{align*}15 \le x < 20\end{align*} | 9 |
\begin{align*}20 \le x < 25\end{align*} | 3 |
3. The histogram associated with this frequency table is shown below.
4. The five number summary.
smallest number \begin{align*}= 0\end{align*}
largest number \begin{align*}= 22\end{align*}
Since there are 60 data points \begin{align*}\left ( n+\frac{1}{2} \right )=30.5\end{align*}.The median is the mean of the \begin{align*}30^{th}\end{align*} and the \begin{align*}31^{st}\end{align*} values.
median \begin{align*}= 6.5\end{align*}
Since each half of the list has 30 values in it, then the first and third quartiles are the medians of each of the smaller lists. The first quartile is the mean of the \begin{align*}15^{th}\end{align*} and \begin{align*}16^{th}\end{align*} values.
first quartile \begin{align*}= 0\end{align*}
The third quartile is the mean of the \begin{align*}45^{th}\end{align*} and \begin{align*}46^{th}\end{align*} values.
third quartile \begin{align*}= 12\end{align*}
The associated box-and-whisker plot is shown below.
Review Questions
- For a class assignment, you have been asked to find out how students get to school. Do they take public transportation, drive themselves, their parents drive them, use carpool or walk/bike. You decide to interview a sample of students. How will you choose those you wish to interview if you want your results to be reliable?
- Comment on the way the following samples have been chosen. For the unsatisfactory cases, suggest a way to improve the sample choice.
- You want to find whether wealthier people have more nutritious diets by interviewing people coming out of a five-star restaurant.
- You want to find if there is there a pedestrian crossing needed at a certain intersection by interviewing people walking by that intersection.
- You want to find out if women talk more than men by interviewing an equal number of men and women.
- You want to find whether students in your school get too much homework by interviewing a stratified sample of students from each grade level.
- You want to find out whether there should be more public busses running during rush hour by interviewing people getting off the bus.
- You want to find out whether children should be allowed to listen to music while doing their homework by interviewing a stratified sample of male and female students in your school.
- Melissa conducted a survey to answer the question “What sport do high school students like to watch on TV the most?” She collected the following information on her data collection sheet.
Sport | Tally | |
---|---|---|
Baseball | 32 | |
Basketball | 28 | |
Football | 24 | |
Soccer | 18 | |
Gymnastics | 19 | |
Figure Skating | 8 | |
Hockey | ||
Total 147 |
(a) Make a pie-chart of the results showing the percentage of people in each category.
(b) Make a bar-graph of the results.
- Samuel conducted a survey to answer the following question: “What is the favorite kind of pie of the people living in my town?” By standing in front of his grocery store, he collected the following information on his data collection sheet:
Type of Pie | Tally | |
---|---|---|
Apple | 37 | |
Pumpkin | 13 | |
Lemon Meringue | 7 | |
Chocolate Mousse | 23 | |
Cherry | 4 | |
Chicken Pot Pie | 31 | |
Other | 7 | |
Total 122 |
(a) Make a pie chart of the results showing the percentage of people in each category.
(b) Make a bar graph of the results.
- Myra conducted a survey of people at her school to see “In which month does a person’s birthday fall?” She collected the following information in her data collection sheet:
Month | Tally | |
---|---|---|
January | 16 | |
February | 13 | |
March | 12 | |
April | 11 | |
May | 13 | |
June | 12 | |
July | 9 | |
August | 7 | |
September | 9 | |
October | 8 | |
November | 13 | |
December | 13 | |
Total: 136 |
(a) Make a pie chart of the results showing the percentage of people whose birthday falls in each month.
(b) Make a bar graph of the results.
- Nam-Ling conducted a survey that answers the question “Which student would you vote for in your school’s elections?” She collected the following information:
Candidate | \begin{align*}9^{th}\end{align*} graders | \begin{align*}10^{th}\end{align*} graders | \begin{align*}11^{th}\end{align*} graders | \begin{align*}12^{th}\end{align*} graders | Total |
---|---|---|---|---|---|
Susan Cho | 19 | ||||
Margarita Martinez | 31 | ||||
Steve Coogan | 16 | ||||
Solomon Duning | 26 | ||||
Juan Rios |