Chapter 1: An Introduction to Analyzing Statistical Data
Chapter Outline
 1.1. Definitions of Statistical Terminology
 1.2. An Overview of Data
 1.3. Measures of Center
 1.4. Measures of Spread
Chapter Summary
Part One: Multiple Choice
 Which of the following is true for any set of data?
 The range is a resistant measure of spread.
 The standard deviation is not resistant.
 The range can be greater than the standard deviation.
 The IQR is always greater than the range.
 The range can be negative.
 The following shows the mean number of days of precipitation by month in Juneau Alaska:
Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec 

\begin{align*}18\end{align*}  \begin{align*} 17 \end{align*}  \begin{align*} 18 \end{align*}  \begin{align*} 17 \end{align*}  \begin{align*} 17 \end{align*}  \begin{align*} 15 \end{align*}  \begin{align*} 17 \end{align*}  \begin{align*} 18 \end{align*}  \begin{align*} 20 \end{align*}  \begin{align*} 24 \end{align*}  \begin{align*} 20 \end{align*}  \begin{align*} 21 \end{align*} 
Source: http://www.met.utah.edu/jhorel/html/wx/climate/daysrain.html (2/06/08)
Which month contains the median number of days of rain?
(a) January
(b) February
(c) June
(d) July
(e) September
 Given this set of data: \begin{align*}2, 10, 14, 6;\end{align*} Which of the following is equivalent to \begin{align*}\bar{x}\end{align*}?
 mode
 median
 midrange
 range
 None of these
 Place the following in order from smallest to largest. I. Range II. Standard Deviation III. Variance
 I, II, III
 I, III, II
 II, III, I
 II, I, III
 It is not possible to determine the correct answer.
 On the first day of school, a teacher asks her students to fill out a survey with their name, gender, age, and homeroom number. How many quantitative variables are there in this example?
 \begin{align*}0\end{align*}
 \begin{align*}1\end{align*}
 \begin{align*}2\end{align*}
 \begin{align*}3\end{align*}
 \begin{align*}4\end{align*}
 You collect data on the shoe sizes of the students in your school by recording the sizes of \begin{align*}50\end{align*} randomly selected males’ shoes. What is the highest level of measurement that you have demonstrated?
 nominal
 ordinal
 interval
 ratio
 Which of the following represents a true statistical experiment?
 Researchers collect temperatures from the Arctic Ocean to determine the rate of climate change.
 Researchers collect, tag, and release geese from Siberia to determine their migration patterns.
 Researchers select \begin{align*}50\end{align*} individuals who smoke \begin{align*}1\end{align*} pack of cigarettes a day and \begin{align*}50\end{align*} individuals who do not smoke to test their lung capacities.
 Researchers select \begin{align*}50\end{align*} individuals at random. \begin{align*}25\end{align*} are given a new drug to boost memory function, \begin{align*}25\end{align*} are given a water pill and told that it is medication that will help their memories (a placebo). The memory function of both groups is then tested and compared.
 Researchers select \begin{align*}50\end{align*} individuals at random and ask them questions about their diet. Each individual’s physical fitness is then tested and compared to determine if there is a relationship between diet and health.
 According to a 2002 study, the mean height of Chinese men between the ages of \begin{align*}30\end{align*} and \begin{align*}65\end{align*} is \begin{align*}164.8\;\mathrm{cm}\end{align*} with a standard deviation of \begin{align*}6.4\;\mathrm{cm}\end{align*} (http://aje.oxfordjournals.org/cgi/reprint/155/4/346.pdf accessed Feb 6, 2008). Which of the following statements is true based on this study?
 The interquartile range is \begin{align*}12.8\;\mathrm{cm}\end{align*}.
 All Chinese men are between \begin{align*}158.4\end{align*} and \begin{align*}171.2\;\mathrm{cm}\end{align*} .
 At least \begin{align*}75\%\end{align*} of Chinese men between \begin{align*}30\end{align*} and \begin{align*}65\end{align*} are between \begin{align*}158.4\end{align*} and \begin{align*}171.2\;\mathrm{cm}\end{align*} .
 At least \begin{align*}75\%\end{align*} of Chinese men between \begin{align*}30\end{align*} and \begin{align*}65\end{align*} are between \begin{align*}152\end{align*} and \begin{align*}177.6\;\mathrm{cm}\end{align*} .
 All Chinese men between \begin{align*}30\end{align*} and \begin{align*}65\end{align*} are between \begin{align*}152\end{align*} and \begin{align*}177.6\;\mathrm{cm}\end{align*}.
 Sampling error is best described as:
 The unintentional mistakes a researcher makes when collecting information.
 The natural variation that is present when you do not get data from the entire population.
 A researcher intentionally asking a misleading question hoping for a particular response.
 When a drug company does their own experiment that proves their medication is the best.
 When individuals in a sample answer a survey untruthfully.
 If the sum of the squared deviations for a sample of \begin{align*}20\end{align*} individuals is \begin{align*}277\end{align*}, the standard deviation is closest to:
 \begin{align*}3.82\end{align*}
 \begin{align*}3.85\end{align*}
 \begin{align*}13.72\end{align*}
 \begin{align*}14.58\end{align*}
 \begin{align*}191.82\end{align*}
Part One: Answers
 b
 a
 b
 e (Note: while the standard deviation MUST always be smaller than the range, the variance is not always smaller than the range. It is also true that the variance is the square of the standard deviation, but some standard deviations will get smaller when they are squared. Challenge students to find examples of data sets that illustrate these points.)
 c
 c
 d
 d
 b
 b
Part Two: OpenEnded Questions
 Erica’s grades in her statistics classes are as follows:
Quizzes: \begin{align*}62, 88, 82\end{align*}
Labs: \begin{align*}89, 96\end{align*}
Tests: \begin{align*}87, 99\end{align*}
(a) In this class, quizzes count once, labs count twice as much as a quiz, and tests count three times. Determine the following:
(i) mode
(ii) mean
(iii) median
(iv) upper and lower quartiles
(v) midrange
(vi) range
(b) If Erica’s \begin{align*}62\end{align*} quiz was removed from the data, briefly describe (without recalculating) the anticipated effect on the statistics you calculated in part a.
 Mr. Crunchy’s sells small bags of potato chips that are advertised to contain \begin{align*}12 \;\mathrm{ounces}\end{align*} of potato chips. To minimize complaints from their customers, the factory sets the machines to fill bags with an average weight of \begin{align*}13 \;\mathrm{ounces}\end{align*}. For an experiment in his statistics class, Spud goes to \begin{align*}5\end{align*} different stores, purchases 1 bag from each store and then weighs the contents. The weights of the bags are: \begin{align*}13.18, 12.65, 12.87, 13.32,\end{align*} and \begin{align*}12.93 \;\mathrm{grams}\end{align*}.
(a) Calculate the sample mean
(b) Complete the chart below to calculate the standard deviation of Spud’s sample.
Observed Data  Deviations  \begin{align*}(x  \bar{x}^2)\end{align*} 

\begin{align*}13.18\end{align*}  
\begin{align*}12.65\end{align*}  
\begin{align*}12.87\end{align*}  
\begin{align*}13.32\end{align*}  
\begin{align*}12.93\end{align*}  
Sum of the deviations \begin{align*} \rightarrow\end{align*} 
(c) Calculate the variance
(d) Calculate the standard deviation
(e) Explain what the standard deviation means in the context of the problem.
 The following table includes data on the number of square kilometers of the more substantial islands of the Galapagos Archipelago (there are actually many more islands if you count all the small volcanic rock outcroppings as islands).
Island  Approximate Area (sq. km) 

Baltra  \begin{align*} 8\end{align*} 
Darwin  \begin{align*} 1.1\end{align*} 
Española  \begin{align*} 60\end{align*} 
Fernandina  \begin{align*} 642\end{align*} 
Floreana  \begin{align*} 173\end{align*} 
Genovesa  \begin{align*} 14\end{align*} 
Isabela  \begin{align*} 4640\end{align*} 
Marchena  \begin{align*} 130\end{align*} 
North Seymour  \begin{align*} 1.9\end{align*} 
Pinta  \begin{align*} 60\end{align*} 
Pinzón  \begin{align*} 18\end{align*} 
Rabida  \begin{align*} 4.9\end{align*} 
San Cristóbal  \begin{align*} 558\end{align*} 
Santa Cruz  \begin{align*} 986\end{align*} 
Santa Fe  \begin{align*} 24\end{align*} 
Santiago  \begin{align*} 585\end{align*} 
South Plaza  \begin{align*} 0.13\end{align*} 
Wolf  \begin{align*}1.3\end{align*} 
Source: http://en.wikipedia.org/wiki/Gal%C3%A1pagos_Islands
(a) Calculate the mode, mean, median, quartiles, range, and standard deviation for this data.
Mode:
Mean:
Median:
Upper Quartile:
Lower Quartile:
Range:
Standard Deviation:
(b) Explain why the mean is so much larger than the median in the context of this data.
(c) Explain why the standard deviation is so large.
 At http://content.usatoday.com/sports/baseball/salaries/default.aspx, USAToday keeps a data base of major league baseball salaries. You will see a pulldown menu that says, “Choose an MLB Team”. Pick a team and find the salary statistics for that team. Next to the current year you will see the median salary. If this site is not available, a web search will most likely locate similar data.
(a) Record the median and verify that it is correct.
(b) Find the other measures of center and record them.
Mean:
Mode:
Midrange:
Lower Quartile:
Upper Quartile:
QR:
(c) Explain the realworld meaning of each measure of center in the context of this data.
Mean:
Median:
Mode:
Midrange:
Lower Quartile:
Upper Quartile:
IQR:
(d) Find the following measures of spread:
Range:
Standard Deviation:
(e) Explain the realworld meaning of each measure of spread in the context of this situation.
(f) Write two sentences commenting on two interesting features about the way the salary data is distributed for this team.
Part Two: Answers
Part Two: Answers


 mode \begin{align*}99\end{align*} and \begin{align*}87\end{align*}
 mean \begin{align*}89.23\end{align*}
 median \begin{align*}89\end{align*}
 upper and lower quartiles \begin{align*}Q1 = 87, Q3 = 97.5\end{align*}
 midrange \begin{align*}80.5\end{align*}
 range\begin{align*} 37\end{align*}
 The \begin{align*}62\end{align*} is an outlier in the data set. This would usually cause the mean to be significantly lower than the median, but the three \begin{align*}99\end{align*}’s are balancing this out. When we remove the \begin{align*}62\end{align*}, the mode should not be affected. The mean should increase. The most dramatic changes will occur in the midrange and range. The range should be much smaller and the midrange should increase. We would expect little change to the medians and quartiles as they are resistant measures.
 \begin{align*} \bar{x}=12.99 \end{align*}
 \begin{align*} s = 0.264 \end{align*}
 \begin{align*} s^2 = 0.07 \end{align*}
 The standard deviation tells you that the “typical” or “average” bag of chips in this sample is within \begin{align*}0.07\;\mathrm{grams}\end{align*} of the mean weight. Based on our sample, we would not have reason to believe that the company is selling unusually light or heavy bags of chips. Their quality control department appears to be doing a good job! (Note: this answer is very subjective for now, but it is important to start thinking in this manner. In later chapters, we will examine more precise measures and conclusions for this process.)

 (a) Mode: \begin{align*}60 \;\mathrm{km}^2 \end{align*} Mean: \begin{align*}439.3 \;\mathrm{km}^2 \end{align*} Median: \begin{align*}42 \;\mathrm{km}^2 \end{align*} Upper Quartile: \begin{align*}558 \;\mathrm{km}^2 \end{align*} Lower Quartile: \begin{align*}4.9 \;\mathrm{km}^2 \end{align*} Range: \begin{align*}2639.67 \;\mathrm{km}^2 \end{align*} Standard Deviation: \begin{align*}1088.69 \;\mathrm{km}^2 \end{align*} (b) There is one very extreme outlier. Isabela is by far the largest island. In addition to that, there are many points in the lower half of the data that are very closely grouped together. Many of these islands are volcanic rock that barely poke above the surface of the ocean. The upper \begin{align*}50\%\end{align*} of the data is much more spread out. This creates a situation in which the median stays very small, but the mean will be strongly pulled towards the larger numbers because it is not resistant. (c) The standard deviation is a statistic that is based on the mean. Therefore, if the mean is not resistant, the standard deviation is not, and it will also be influenced by the larger numbers. If it is a measure of the “typical” distance from the mean, then the larger points will have a disproportionate influence on the calculation. On a more intuitive level, if the upper \begin{align*}50\%\end{align*} of the data is very widely spread, the standard deviation reflects that extreme variation.
 (a) Will vary (b) Will vary (c) Mean: the average salary of the players on this team in 2007. Median: the salary at which half the players on the team make more than that, and half the players make less than that. Mode: the salary that more players make than any other individual salary. Usually, this is a league minimum salary that many players make. Midrange: The mean of just the highest paid and lowest paid players. Lower Quartile: The salary at which only \begin{align*}25\%\end{align*} of the players on the team make less. Upper Quartile: The salary at which \begin{align*}75\%\end{align*} of the players make less, or the salary at which only one quarter of the team makes more. IQR: The middle \begin{align*}50\%\end{align*} of the players varies by this amount. (d) Range: The gap in salary between the highest and lowestpaid players. Standard Deviation: the amount by which a typical player’s salary varies from the mean salary. (e). (f) Answers will vary, but students should comment on spread in one sentence and center in the other. Since many baseball teams have a few star players who make much higher salaries, most examples should give the students an opportunity to comment on the presence of outliers and their affect on the statistical measures of center and spread.