1.4: Measures of Spread
Learning Objectives
- Calculate the range and interquartile range.
- Calculate the standard deviation for a population and a sample, and understand its meaning.
- Distinguish between the variance and the standard deviation.
- Calculate and apply Chebyshev’s Theorem to any set of data.
Introduction
In the last lesson we concentrated on statistics that provided information about the way in which a data set is centered. Another important feature that can help us understand more about a data set is the manner in which the data is distributed or spread. Variation and dispersion are words that are also commonly used to describe this feature. There are several commonly used statistical measures of spread that we will investigate in this lesson.
Range
For most students, their first introduction to a statistic that measures spread is the range. The range is simply the difference between the smallest value (minimum) and the largest value (maximum) in the data. Let’s return to the data set used in the previous lesson:
Most students find it intuitive to say that the values range from to . However, the range is a statistic, and as such is a single number. It is therefore more proper to say that the range is .
The range is useful because it requires very little calculation and therefore gives a quick and easy “snapshot” of how the data is spread, but it is limited because it only involves two values in the data set and it is not resistant to outliers.
Interquartile Range
Similar to the range, the interquartile range is the difference between the quartiles. If the range tells us how widely spread the entire data set is, the interquartile range (abbreviated IQR) gives information about how the middle of the data is spread.
Example:
A recent study proclaimed Mobile, Alabama the “wettest” city in America (http://www.livescience.com/environment/070518_rainy_cities.html). The following table lists a measurement of the approximate annual rainfall in Mobile for the last . Find the Range and IQR for this data.
Year | Rainfall (inches) |
---|---|
1998 | |
1999 | |
2000 | |
2001 | |
2002 | |
2003 | |
2004 | |
2005 | |
2006 | |
2007 |
Figure: Approximate Total Annual Rainfall, Mobile, Alabama. source: http://www.cwop1353.com/CoopGaugeData.htm
First, place the data in order from smallest to largest. The range is the difference between the minimum and maximum rainfall amounts.
To find the IQR, first identify the quartiles, and then subtract
Even though we are doing easy calculations, statistics is never about meaningless arithmetic and you should always be thinking about what a particular statistical measure means in the real context of the data. In this example, the range tells us that there is a difference of of rainfall between the wettest and driest years in Mobile. The IQR shows that there is a difference of of rainfall even in the middle of the data. It appears that Mobile experiences wide fluctuations in yearly rainfall totals, which might be explained by its position near the Gulf of Mexico and its exposure to tropical storms and hurricanes.
Standard Deviation
The standard deviation is an extremely important measure of spread that is based on the mean. Recall that the mean is the numerical balancing point of the data. One way to measure how the data is spread is to look at how far away the values are from the mean. The difference between the actual value and the mean is called the deviation. Written symbolically it would be:
Let’s take a simple data set of three randomly selected individuals’ shoe sizes:
and
The mean of this data set is . The deviations then would be as follows:
Notice that the deviation of a point that is less than the mean is negative. Points that are above the mean have positive deviations.
We need a statistic that can summarize all of the deviations. The standard deviation is such a summary. It is a measure of the “typical” or “average” deviation for all of the data points from the mean. However, the very property that makes the mean so special also makes it tricky to calculate a standard deviation. Because the mean is the balancing point of the data, when you add the deviations, they sum to , in effect canceling each other out.
Observed Data | Deviations |
---|---|
Sum of the deviations |
So we need all the deviations to be positive before we add them up. One way to do this would be to simply make them positive by taking their absolute values. This is a technique we use for a similar measure called the mean absolute deviation, but for the standard deviation, we square all the deviations. The square of any real number is always positive.
Observed Data | Deviations | |
---|---|---|
Sum of the deviations |
Now find the sum of the squared deviations:
Observed Data | Deviations | |
---|---|---|
Sum of the squared deviations |
Normally if you were finding a mean, you would now divide by the number of numbers . This is the part that puzzles many beginning statistics students. Instead of dividing by , we divide by , which will be explained later in this section. Dividing by gives:
Remember that this number was obtained by squaring the deviations, so the result is much larger than it should be. This quantity is actually called the variance and it will be very important in later chapters. The final step is to “unsquare” the variance, or take the square root:
This is the standard deviation! This means that in our sample, the “typical” value is approximately away from the mean.
Technology Note: standard deviation on the TI-83 or 84
- Enter the above data in list [L1], as you did in the previous lesson (see first screen below).
- Then choose 1-Var Stats from the [CALC] submenu of the [STAT] menu (second screen).
- Enter (third screen) and press [enter] to see the fourth screen.
- In the fourth screen, the symbol Sx is the standard deviation.
Why n-1?
There are several ways to look at the need to divide the sum by when calculating the standard deviation. For now, we will skip some of the more technical explanations that involve things like degrees of freedom that you will cover in later chapters in favor of adjusting for sampling error. Dividing by is only necessary for the calculation of the standard deviation of a sample. When you are calculating the standard deviation of a population, you divide by the number of numbers . But when you have a sample, you are not getting data for the entire population and there is bound to be random variation due to sampling (remember that this is called sampling error).
When we claim to have the standard deviation, we are making the following statement:
“The typical distance of a point from the mean is …”
But we might be off by a little from using a sample, so it would be better to overestimate to represent the standard deviation.
Sample Standard Deviation:
Because the variance is the square of the standard deviation, the variance formulas are as follows:
Variance of a population:
Variance of a sample:
Chebyshev’s Theorem
Pafnuty Chebyshev was a Century Russian mathematician. The theorem named for him gives us information about how many elements of a data set are within a certain number of standard deviations of the mean.
The formal statement is as follows:
The proportion of data that lies within k standard deviations of the mean is at least:
, where
As an example, let’s return to the rainfall data from Mobile. The mean yearly rainfall amount is and the sample standard deviation is about .
Let’s investigate the information that Chebyshev’s Theorem gives us about the proportion of data within standard deviations of the mean. If we replace with , the result is:
So the theorem predicts that at least of the data is within standard deviations of the mean.
According to the drawing, Chebyshev’s Theorem states that at least of the data is between and . Well, this probably doesn’t seem too significant in this example, because all of the data falls within that range. In a later chapter we will learn a more informative rule about standard deviation, but the advantage of Chebyshev’s Theorem is that it applies to any sample or population, no matter how it is distributed.
Lesson Summary
When examining a set of data, we also use descriptive statistics to provide information about how the data is spread out. The range is a measure of the difference between the smallest and largest numbers in a data set. The interquartile range is the difference between the upper and lower quartiles. A more informative measure of spread is based on the mean. We can look at how individual points vary from the mean by subtracting the mean from the data value. This is called the deviation. The standard deviation is a measure of the “average” deviation for the entire data set. Because the deviations always sum to zero, we find the standard deviation by adding the squared deviations. When we have the entire population, the sum of the squared deviations is divided by the population size. This quantity is called the variance. Taking the square root of the variance gives the standard deviation. For a population, the standard deviation is notated . Because a sample is prone to random variation (sampling error), we adjust the sample standard deviation to make it a little larger by divided the squared deviations by one less than the number of observations. The result of that division is the sample variance, and the square root of the sample variance is the sample standard deviation, usually notated as s. Chebyshev’s Theorem gives us a information about the minimum percentage of data that is within a certain number of standard deviations of the mean it applies to any population or sample, regardless of how that data is distributed.
Points to Consider
- How do you determine which measure of spread best describes a particular data set?
- What information does the standard deviation tell us about the specific, real data being observed?
- What are the effects of outliers on the various measures of spread?
- How does altering the spread of a data set affect its visual representation(s)?
Review Questions
- Use the rainfall data from figure 1 to answer this question
- Calculate and record the sample mean:
- Complete the chart to calculate the standard deviation and the variance.
Year | Rainfall (inches) | Deviation | Squared Deviations |
---|---|---|---|
1998 | |||
1999 | |||
2000 | |||
2001 | |||
2002 | |||
2003 | |||
2004 | |||
2005 | |||
2006 | |||
2007 | |||
Sum |
Variance:
Standard Deviation:
Use the Galapagos Tortoise data below to answer questions 2 and 3.
Island or Volcano | Number of Individuals Repatriated |
---|---|
Wolf | |
Darwin | |
Alcedo | |
Sierra Negra | |
Cerro Azul | |
Santa Cruz | |
Española | |
San Cristóbal | |
Santiago | |
Pinzón | |
Pinta |
- Calculate the Range and the IQR for this data.
- Calculate the standard deviation for this data.
- If , then the population standard deviation is:
- Which data set has the largest standard deviation?
Review Answers
- (a) (b)
Year | Rainfall (inches) | Deviation | Squared Deviations |
---|---|---|---|
1998 | |||
1999 | |||
2000 | |||
2001 | |||
2002 | |||
2003 | |||
2004 | |||
2005 | |||
2006 | |||
2007 | |||
Sum |
Variance:
Standard Deviation:
- RANGE: IQR:
- a
- b
Further Reading
- http://mathcentral.uregina.ca/QQ/database/QQ.09.99/freeman2.html
- http://mathforum.org/library/drmath/view/52722.html
- http://edhelper.com/statistics.htm
- http://www.newton.dep.anl.gov/newton/askasci/1993/math/MATH014.HTM
Vocabulary
- Range
- The maximum value in a data set minus the minimum value.
- Interquartile Range (IQR)
- The upper quartile in a data set minus the lower quartile.
- Deviation
- The difference of the mean of a data set subtracted from the actual data value.
- Standard Deviation
- A measure of the “typical” distance of all the data points in a set from the mean.
- Population Standard Deviation
- The square root of the result of dividing the sum of the squared deviations by the population size.
- Sample Standard Deviation
- The square root of the result of dividing the sum of the squared deviations by one less than the sample size.
- Variance
- The square of the standard deviation.