<img src="https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=iA1Pi1a8Dy00ym" style="display:none" height="1" width="1" alt="" />

## Range, variance, standard deviation

Estimated12 minsto complete
%
Progress
Progress
Estimated12 minsto complete
%

Another important feature that can help us understand more about a data set is the manner in which the data are distributed, or spread. Variation and dispersion are words that are also commonly used to describe this feature. There are several commonly used statistical measures of spread that we will investigate in this lesson.

Range

One measure of spread is the range. The range is simply the difference between the largest value (maximum) and the smallest value (minimum) in the data.

#### Calculating the Range

Return to the data set used in the previous lesson, which is shown below:

75, 80, 90, 94, 96

The range of this data set is 9675=21\begin{align*}96 - 75 = 21\end{align*}. This is telling us the distance between the maximum and minimum values in the data set.

The range is useful because it requires very little calculation, and therefore, gives a quick and easy snapshot of how the data are spread. However, it is limited, because it only involves two values in the data set, and it is not resistant to outliers.

Interquartile Range

The interquartile range is the difference between the Q3\begin{align*}Q_3\end{align*} and Q1\begin{align*}Q_1\end{align*}, and it is abbreviated IQR\begin{align*}IQR\end{align*}. Thus, IQR=Q3Q1\begin{align*}IQR = Q_3-Q_1\end{align*}. The IQR\begin{align*}IQR\end{align*} gives information about how the middle 50% of the data are spread. Fifty percent of the data values are always between Q3\begin{align*}Q_3\end{align*} and Q1\begin{align*}Q_1\end{align*}.

#### Calculating the Range and the Interquartile Range

A recent study proclaimed Mobile, Alabama the wettest city in America. The following table lists measurements of the approximate annual rainfall in Mobile over a 10 year period. Find the range and IQR\begin{align*}IQR\end{align*} for this data.

Rainfall (inches)
1998 90
1999 56
2000 60
2001 59
2002 74
2003 76
2004 81
2005 91
2006 47
2007 59

Figure 1: Approximate Total Annual Rainfall, Mobile, Alabama.

First, place the data in order from smallest to largest. The range is the difference between the minimum and maximum rainfall amounts.

To find the IQR\begin{align*}IQR\end{align*}, first identify the quartiles, and then compute Q3Q1\begin{align*}Q_3-Q_1\end{align*}.

In this example, the range tells us that there is a difference of 44 inches of rainfall between the wettest and driest years in Mobile. The IQR\begin{align*}IQR\end{align*} shows that there is a difference of 22 inches of rainfall, even in the middle 50% of the data. It appears that Mobile experiences wide fluctuations in yearly rainfall totals, which might be explained by its position near the Gulf of Mexico and its exposure to tropical storms and hurricanes.

Standard Deviation

The standard deviation is an extremely important measure of spread that is based on the mean. Recall that the mean is the numerical balancing point of the data. One way to measure how the data are spread is to look at how far away each of the values is from the mean. The difference between a data value and the mean is called the deviation. Written symbolically, it would be as follows:

Deviation=xx¯¯¯\begin{align*}\text{Deviation} = x-\overline{x}\end{align*}

Let’s take the simple data set of three randomly selected individuals’ shoe sizes shown below:

9.5, 11.5, 12

The mean of this data set is 11. The deviations are as follows:

Table of Deviations
x\begin{align*}x\end{align*} xx¯¯¯\begin{align*}x-\overline{x}\end{align*}
9.5 9.511=1.5\begin{align*}9.5 - 11 = -1.5\end{align*}
11.5 11.511=0.5\begin{align*}11.5 - 11 = 0.5\end{align*}
12 1211=1\begin{align*}12 - 11 = 1\end{align*}

Notice that if a data value is less than the mean, the deviation of that value is negative. Points that are above the mean have positive deviations.

The standard deviation is a measure of the typical, or average, deviation for all of the data points from the mean. However, the very property that makes the mean so special also makes it tricky to calculate a standard deviation. Because the mean is the balancing point of the data, when you add the deviations, they always sum to 0.

Table of Deviations, Including the Sum.
Observed Data Deviations
9.5 9.511=1.5\begin{align*}9.5 - 11 = -1.5\end{align*}
11.5 11.511=0.5\begin{align*}11.5 - 11 = 0.5\end{align*}
12 1211=1\begin{align*}12 - 11 = 1\end{align*}
Sum of deviations 1.5+0.5+1=0\begin{align*}-1.5 + 0.5 + 1 = 0\end{align*}

Therefore, we need all the deviations to be positive before we add them up. One way to do this would be to make them positive by taking their absolute values. This is a technique we use for a similar measure called the mean absolute deviation. For the standard deviation, though, we square all the deviations. The square of any real number is always positive.

Observed Data x\begin{align*}x\end{align*} Deviation xx¯¯¯\begin{align*}x-\overline{x}\end{align*} (xx¯¯¯)2\begin{align*}(x-\overline{x})^2\end{align*}
9.5 1.5\begin{align*}-1.5\end{align*} (1.5)2=2.25\begin{align*}(-1.5)^2=2.25\end{align*}
11.5 0.5 (0.5)2=0.25\begin{align*}(0.5)^2=0.25\end{align*}
12 1 1

Sum of the squared deviations=2.25+0.25+1=3.5\begin{align*}\text{Sum of the squared deviations} = 2.25 + 0.25 + 1 = 3.5\end{align*}

We want to find the average of the squared deviations. Usually, to find an average, you divide by the number of terms in your sum. In finding the standard deviation, however, we divide by n1\begin{align*}n-1\end{align*}. In this example, since n=3\begin{align*}n=3\end{align*}, we divide by 2. The result, which is called the variance, is 1.75. The variance of a sample is denoted by s2\begin{align*}s^2\end{align*} and is a measure of how closely the data are clustered around the mean. Because we squared the deviations before we added them, the units we were working in were also squared. To return to the original units, we must take the square root of our result: 1.751.32\begin{align*}\sqrt{1.75} \approx 1.32\end{align*}. This quantity is the sample standard deviation and is denoted by s\begin{align*}s\end{align*}. The number indicates that in our sample, the typical data value is approximately 1.32 units away from the mean. It is a measure of how closely the data are clustered around the mean. A small standard deviation means that the data points are clustered close to the mean, while a large standard deviation means that the data points are spread out from the mean.

#### Interpreting Variance

The following are scores for two different students on two quizzes:

Student 1: 100;0\begin{align*}100; \quad 0\end{align*}

Student 2: 50;50\begin{align*}50; \quad 50\end{align*}

Note that the mean score for each of these students is 50.

Student 1: Deviations: 10050=50;050=50\begin{align*}100 - 50 = 50; \quad 0 - 50 = -50\end{align*}

Squared deviations: 2500;2500\begin{align*}2500; \quad 2500\end{align*}

Variance =5000\begin{align*}=5000\end{align*}

Standard Deviation =70.7\begin{align*}=70.7\end{align*}

Student 2: Deviations: 5050=0;5050=0\begin{align*}50 - 50 = 0; \quad 50 - 50 = 0\end{align*}

Squared Deviations: 0;0\begin{align*}0; \quad 0\end{align*}

Variance =0\begin{align*}= 0\end{align*}

Standard Deviation =0\begin{align*}= 0\end{align*}

Student 2 has scores that are tightly clustered around the mean. In fact, the standard deviation of zero indicates that there is no variability. The student is absolutely consistent.

So, while the average of each of these students is the same (50), one of them is consistent in the work he/she does, and the other is not. This raises questions: Why did student 1 get a zero on the second quiz when he/she had a perfect paper on the first quiz? Was the student sick? Did the student forget about the quiz and not study? Or was the second quiz indicative of the work the student can do, and was the first quiz the one that was questionable? Did the student cheat on the first quiz?

There is one more question that we haven't answered regarding standard deviation, and that is, "Why n1\begin{align*}n-1\end{align*}?" Dividing by n1\begin{align*}n-1\end{align*} is only necessary for the calculation of the standard deviation of a sample. When you are calculating the standard deviation of a population, you divide by N\begin{align*}N\end{align*}, the number of data points in your population. When you have a sample, you are not getting data for the entire population, and there is bound to be random variation due to sampling (remember that this is called sampling error).

When we claim to have the standard deviation, we are making the following statement:

“The typical distance of a point from the mean is ...”

But we might be off by a little from using a sample, so it would be better to overestimate s\begin{align*}s\end{align*} to represent the standard deviation.

Formulas

Sample Standard Deviation:

\begin{align*}s=\sqrt{\frac{\sum_{i=1}^n (x_i-\overline{x})^2}{n-1}}\end{align*}

where:

\begin{align*}x_i\end{align*} is the \begin{align*}i^{\text{th}}\end{align*} data value.

\begin{align*}\overline{x}\end{align*} is the mean of the sample.

\begin{align*}n\end{align*} is the sample size.

Variance of a sample:

\begin{align*}s^2= \frac{\sum_{i=1}^n (x_i-\overline{x})^2}{n-1}\end{align*}

where:

\begin{align*}x_i\end{align*} is the \begin{align*}i^{\text{th}}\end{align*} data value.

\begin{align*}\overline{x}\end{align*} is the mean of the sample.

\begin{align*}n\end{align*} is the sample size.

Chebyshev’s Theorem

Pafnuty Chebyshev was a \begin{align*}19^{\text{th}}\end{align*} Century Russian mathematician. The theorem named for him gives us information about how many elements of a data set are within a certain number of standard deviations of the mean.

The formal statement for Chebyshev’s Theorem is as follows:

The proportion of data points that lie within \begin{align*}k\end{align*} standard deviations of the mean is at least:

\begin{align*}1-\frac{1}{k^2}, \ k>1\end{align*}

#### Using Chebyshev's Theorem

Given a group of data with mean 60 and standard deviation 15, at least what percent of the data will fall between 15 and 105?

15 is three standard deviations below the mean of 60, and 105 is 3 standard deviations above the mean of 60. Chebyshev’s Theorem tells us that at least \begin{align*}1-\frac{1}{3^2} = 1-\frac{1}{9} = \frac{8}{9} \approx 0.89 =89\%\end{align*} of the data will fall between 15 and 105.

### Examples

For the following problems use the rainfall data from Mobile. The mean yearly rainfall amount is 69.3, and the sample standard deviation is about 14.4.

#### Example 1

What percentage of the data is within two standard deviations of the mean?

Chebyshev’s Theorem tells us about the proportion of data within \begin{align*}k\end{align*} standard deviations of the mean. If we replace \begin{align*}k\end{align*} with 2, the result is as shown:

\begin{align*}1-\frac{1}{2^2} = 1- \frac{1}{4}=\frac{3}{4}\end{align*}So the theorem predicts that at least 75% of the data is within 2 standard deviations of the mean.

#### Example 2

Is the following answer significant? According to the drawing above, Chebyshev’s Theorem states that at least 75% of the data is between 40.5 and 98.1. This doesn’t seem too significant in this example, because all of the data falls within that range.

#### Example 3

What is the main advantage of Chebyshev's Theorem? The advantage of Chebyshev’s Theorem is that it applies to any sample or population, no matter how it is distributed.

### Review

1. Following are bowling scores for two people: Luna - 112, 105, 138, 125, and 115; Chris - 142, 116, 100, 132, and 105.
1. Show that Chris and Luna have the same mean and range.
2. Whose performance is more variable? Explain.
1. Use the rainfall data from figure 1 to answer this question.
1. Calculate and record the sample mean:
2. Complete the chart to calculate the variance and the standard deviation.
Year Rainfall (inches) Deviation Squared Deviations
1998 90
1999 56
2000 60
2001 59
2002 74
2003 76
2004 81
2005 91
2006 47
2007 59

For 3-4, use the Galapagos Tortoise data below.

Island or Volcano Number of Individuals Repatriated
Wolf 40
Darwin 0
Alcedo 0
Sierra Negra 286
Cerro Azul 357
Santa Cruz 210
Española 1293
San Cristóbal 55
Santiago 498
Pinzón 552
Pinta 0
1. Calculate the range and the \begin{align*}IQR\end{align*} for this data.
2. Calculate the sample standard deviation for this data.
1. If \begin{align*}\sigma^2=9\end{align*}, then the population standard deviation is:
1. 3
2. 8
3. 9
4. 81
2. Which data set has the largest standard deviation?
1. 10 10 10 10 10
2. 0 0 10 10 10
3. 0 9 10 11 20
4. 20 20 20 20 20
3. How do you determine which measure of spread best describes a particular data set?
4. What information does the standard deviation tell us about the specific, real data being observed?
5. What are the effects of outliers on the various measures of spread?
6. How does altering the spread of a data set affect its visual representation(s)?

Technology Notes:

Calculating Standard Deviation on the TI-83/84 Graphing Calculator

Enter the data 9.5, 11.5, 12 in list L1 (see first screen below).

Then choose '1-Var Stats' from the CALC submenu of the STAT menu (second screen).

Enter L1 (third screen) and press [ENTER] to see the fourth screen.

In the fourth screen, the symbol \begin{align*}s_x\end{align*} is the sample standard deviation.

To view the Review answers, open this PDF file and look for section 1.5.

### Notes/Highlights Having trouble? Report an issue.

Color Highlighted Text Notes

### Vocabulary Language: English

Chebyshev's theorem

Chebyshev’s Theorem gives us information about the minimum percentage of data that falls within a certain number of standard deviations of the mean, and it applies to any population or sample, regardless of how that data set is distributed.

descriptive statistics

In descriptive statistics, the goal is to describe the data that found in a sample or given in a problem.

deviation

Deviation is a measure of the difference between a given value and the mean.

Dispersion

The dispersion is equal to the range of a given set of data.

Interquartile range

The interquartile range is the difference between the third quartile and the first quartile (Q3-Q1).

Range

The range of a data set is the difference between the smallest value and the greatest value in the data set.

Sampling error (random variation)

Sampling error occurs whenever a sample is used instead of the entire population, where we have to accept that our results are merely estimates, and therefore, have some chance of being incorrect.

standard deviation

The square root of the variance is the standard deviation. Standard deviation is one way to measure the spread of a set of data.

variance

A measure of the spread of the data set equal to the mean of the squared variations of each data value from the mean of the data set.