Two groups of students that each have an average test score of 75 might have a score distribution that looks remarkably different. One class might be made up entirely of grades between 72 and 78 while the other class may have half the group around 50, with the other half getting near 100. **Variance** is a way of measuring the variation in a set of data or how spread out the data is. What is the mean and variance for the following sample test scores taken from a larger student population?

75, 73, 78, 90, 60, 51, 87, 79, 80, 77

### Finding Variance

The thought process of a person trying to describe the spread (variation) of some data for the first time must have been something like this.

Well, the average is 75. What if I try to just add up how different each number is from 75?

As the person calculates the numbers, they realize pretty quickly that this sum will be zero, essentially by definition. This is because the numbers that occur below 75 precisely cancel out with the numbers above 75.

Since I cannot add the differences directly, why don’t I just sum the absolute value of the differences?

This is a legitimate method for describing the spread of data. It is called **absolute deviation** and is simply the sum of the absolute values of each of the differences.

If I take the average absolute difference, I will be able to judge on average how far away each data point is from the mean. A larger difference means more spread out.

If you take the average of the absolute deviation, you get the **mean absolute deviatio**n. The mean absolute variation is a legitimate, but limited, way of describing the spread of data. Eventually, a person trying to describe the spread of data for the first time might consider a method called **population variance**.

What if instead of using absolute value to solve the issue, I square each difference and then add them together? Of course I’d have to divide by the number of data points to get the average difference squared

This method turns out to be extraordinarily powerful in statistics. One downside is that most of the time you cannot get data from the entire population, you usually only get it from a sample. Over time people realized that samples were typically less variable than their populations and dividing by the number of data points was consistently underestimating the true variance of the population. In other words, if \begin{align*}n\end{align*} is the size of the sample then multiplying the sum of the square differences by \begin{align*}\frac{1}{n}\end{align*} makes the variance too small. Research and theory progressed until it was realized that multiplying the sum of the square differences by \begin{align*}\frac{1}{n-1}\end{align*} made the fraction slightly larger and properly estimated the variance of the population. Thus, there are two ways to calculate variance, one for populations and one for samples.

Hey wait, by squaring the differences, doesn’t that mean that the units are squared? What if I want to describe the spread in the regular units? Should I just take the square root of the variance?

This is why the Greek letter lowercase sigma, \begin{align*}\sigma\end{align*}, is used for **standard deviation** of a population (which is the square root of the variance) and \begin{align*}\sigma ^2\end{align*} is the symbol for variance of a population. The letters \begin{align*}s\end{align*}, \begin{align*}s^2\end{align*} are used for sample standard deviation and sample variance. The Greek letter mu, \begin{align*}\mu\end{align*}, is the symbol used for mean of a population, while \begin{align*}\overline{x}\end{align*} is the symbol used for mean of a sample.

#### Mean and variance for the population: \begin{align*}x_1, x_2, x_3, \ldots, x_n\end{align*}

\begin{align*}\mu&=\frac{1}{n} \cdot \sum\limits_{i=1}^n x_i\\ \sigma^2&=\frac{1}{n} \cdot \sum\limits_{i=1}^n (\mu - x_i)^2 \end{align*}

#### Mean and variance for a sample from a population: \begin{align*}x_1, x_2, x_3, \ldots , x_m\end{align*}

\begin{align*}\overline{x}&=\frac{1}{m} \cdot \sum\limits_{i=1}^m x_i\\ s^2&=\frac{1}{m-1} \cdot \sum\limits_{i=1}^m (\overline{x} - x_i)^2 \end{align*}

Remember that variance is a measure of the spread of data. The bigger the variance, the more spread out the data points.

Take a six sided dice. Since the population for a six sided die is entirely known, you would use the population variance to calculate the variance and mean. You would get.

\begin{align*}\mu=\frac{1}{6} (1+2+3+4+5+6)=\frac{1}{6}\cdot 21=3.5\end{align*}

\begin{align*}\sigma^2 &=\frac{1}{6}\left[(3.5-1)^2+(3.5-2)^2+(3.5-3)^2+(3.5-4)^2+(3.5-5)^2+(3.5-6)^2 \right] \\ &=\frac{1}{6}[6.25+2.25+0.25+0.25+2.25+6.25] \\ & \approx 2.9167\end{align*}

### Examples

#### Example 1

Earlier, you were asked to find the mean and variance for the following sample test scores taken from a larger student population:

75, 73, 78, 90, 60, 51, 87, 79, 80, 77

The mean of the test scores is 75. The variance is calculated by taking the difference of each number from the mean, squaring and summing these differences.

\begin{align*}0^2+2^2+3^2+15^2+15^2+24^2+12^2+4^2+5^2+2^2=1228\end{align*}

Since this data is a sample, you divide the sum by one fewer than the number of terms.

\begin{align*}\frac{1228}{10-1}\approx 136.4444\end{align*}

If you knew the variances for two samples, each from a different class, you could quickly determine which class had test scores that were more spread out.

#### Example 2

Calculate the mean and variance of the following data sample of lap times.

59.8, 57.1, 58.2, 58.6, 57.8, 57.9, 58.0, 57.3

\begin{align*}\overline{x}=\frac{1}{8}(59.8+57.1+58.2+ 58.6+57.8+57.9+58.0+57.3)=58.0875\end{align*}

This is a sample, so you should use the sample variance formula.

\begin{align*}s^2 &=\frac{1}{8-1} \cdot \left[(\mu-59.8)^2+(\mu-57.1)^2+(\mu-58.2)^2+(\mu-58.6)^2+(\mu-57.8)^2 \right .\\ & \quad \left . +(\mu-57.9)^2+(\mu-58.0)^2+(\mu-57.3)^2 \right] \\ &=\frac{1}{7} \left[(-1.7125)^2+0.9875^2+(-0.1125)^2+(-0.5125)^2+0.2875^2+0.1875^2+0.0875^2+0.7875^2\right] \\ & \approx \frac{1}{7}[2.9327+0.9751+0.0126+0.2626+0.0826+0.0351+0.0076+0.6201] \\ & \approx \frac{1}{7}[4.9288]\\ & \approx 0.7041 \end{align*}

#### Example 3

Use a calculator to calculate the variance from Example 2.

To calculate variance on your calculator, enter the data in a list, choose 1-Var Stats and run the 1-Var Stats on the list you entered the data.

The two outputs that are important for you to interpret are:

\begin{align*}Sx = 0.839110924\end{align*}

\begin{align*}\sigma x =0.7848163968\end{align*}

Since the calculator does not know whether the data is a population or a sample, it produces both. Since this problem is about a sample, the number of interest is \begin{align*}Sx\end{align*}. This number does not match the variance from Example B because it is the sample standard deviation which means it is the square root of the sample variance. The calculator produces standard deviation. You need to square that number to produce the appropriate variance.

\begin{align*}0.8391^2\approx 0.7041\end{align*}

#### Example 4

Calculate the standard deviation for the following 6 numbers by hand. Assume the numbers are a population.

2, 4, 6, 8, 12, 19

\begin{align*}\mu &=\frac{1}{6}(2+4+6+8+12+17)=8 \\ \sigma^2 &=\frac{1}{6}((8-2)^2+(8-4)^2+(8-6)^2+0+(8-12)^2+(8-17)^2) \\ & =\frac{1}{6}(6^2+4^2+2^2+4^2+9^2 )\\ & =\frac{1}{6}(36+16+4+16+81) \\ & =\frac{1}{6}(153) \\ & =25.5 \\ \sigma &\approx 5.0498 \end{align*}

#### Example 5

Use a spreadsheet to organize your calculations for computing the variance of the following numbers. Assume these numbers are a true population.

14, 15, 7, 15, 2, 0, 6, 5, 12, 3

After entering the data in a column, you can use the power of the embedded programming of the spreadsheet to make a second column of just the average.

- The average command is: “\begin{align*}=\text{average}\text{(A2:A11)}\end{align*}”

You can subtract one cell from another cell to find the difference. You can then square the difference to find the difference squared. You can then sum these values using the sum command.

- The sum command is: “\begin{align*}\text{= sum(D2:D11)}\end{align*}”

Finally, just divide the sum by the number of observations (which is 10) to get the variance.

### Review

1. What are the similarities and differences between standard deviation and variance?

2. Data Set A has a mean of 30 and a standard deviation of 10. Data Set B also has a mean of 30, but a standard deviation of 2. What does this mean about Data Set A compared to Data Set B?

Calculate the variance of each set of data by hand.

3. Sample: 1, 4, 7, 10, 3, 6, 12, 5, 8, 16, 21, 3, 1, 5

4. Population: 23, 27, 19, 24, 20, 22, 31, 30, 28

5. Sample: 64, 62, 60, 58, 54, 60, 61, 63, 47, 100, 29, 59

Calculate the variance of each set of data using your calculator. Compare your answers to your answers to 3-5.

6. Sample: 1, 4, 7, 10, 3, 6, 12, 5, 8, 16, 21, 3, 1, 5

7. Population: 23, 27, 19, 24, 20, 22, 31, 30, 28

8. Sample: 64, 62, 60, 58, 54, 60, 61, 63, 47, 100, 29, 59

9. If \begin{align*}\sigma^2=16\end{align*}, what is the population standard deviation?

10. Which data set has the largest standard deviation?

- 10 10 10 10 10
- 0 0 10 10 10
- 0 9 10 11 20
- 20 20 20 20 20

11. What will a large variance look like on a histogram? What will a small variance look like on a histogram?

12. You find some data organized in a bar graph. Could you calculate the variance of this data? Explain.

13. A sample set of 20 exam scores is 67, 94, 88, 76, 85, 93, 55, 87, 80, 81, 80, 61, 90, 84, 75, 93, 75, 68, 100, 98. Calculate the mean, variance, and standard deviation for this data.

14. All of Mike’s bowling scores are: 1, 1, 2, 10, 12, 1, 9, 6, 7, 8, 4, 3, 4, 1, 4, 1, 6, 7, 11, 5. Calculate the mean, variance, and standard deviation for this data.

15. Why can’t you always calculate the population variance and standard deviation? Why do you sometimes have to calculate the sample variance and standard deviation?

### Review (Answers)

To see the Review answers, open this PDF file and look for section 15.5.