<img src="https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=iA1Pi1a8Dy00ym" style="display:none" height="1" width="1" alt="" />
Dismiss
Skip Navigation

Variance of a Data Set

The mean of the squares of the deviation of data values

Atoms Practice
Estimated17 minsto complete
%
Progress
Practice Variance of a Data Set
 
 
 
MEMORY METER
This indicates how strong in your memory this concept is
Practice
Progress
Estimated17 minsto complete
%
Practice Now
Turn In
Variance

Two groups of students that each have an average test score of 75 might have a score distribution that looks remarkably different. One class might be made up entirely of grades between 72 and 78 while the other class may have half the group around 50, with the other half getting near 100. Variance is a way of measuring the variation in a set of data or how spread out the data is. What is the mean and variance for the following sample test scores taken from a larger student population?

75, 73, 78, 90, 60, 51, 87, 79, 80, 77

Finding Variance

The thought process of a person trying to describe the spread (variation) of some data for the first time must have been something like this.

Well, the average is 75. What if I try to just add up how different each number is from 75?

As the person calculates the numbers, they realize pretty quickly that this sum will be zero, essentially by definition. This is because the numbers that occur below 75 precisely cancel out with the numbers above 75.

Since I cannot add the differences directly, why don’t I just sum the absolute value of the differences?

This is a legitimate method for describing the spread of data. It is called absolute deviation and is simply the sum of the absolute values of each of the differences.

If I take the average absolute difference, I will be able to judge on average how far away each data point is from the mean. A larger difference means more spread out.

If you take the average of the absolute deviation, you get the mean absolute deviation. The mean absolute variation is a legitimate, but limited, way of describing the spread of data. Eventually, a person trying to describe the spread of data for the first time might consider a method called population variance.

What if instead of using absolute value to solve the issue, I square each difference and then add them together? Of course I’d have to divide by the number of data points to get the average difference squared

This method turns out to be extraordinarily powerful in statistics. One downside is that most of the time you cannot get data from the entire population, you usually only get it from a sample. Over time people realized that samples were typically less variable than their populations and dividing by the number of data points was consistently underestimating the true variance of the population. In other words, if \begin{align*}n\end{align*} is the size of the sample then multiplying the sum of the square differences by \begin{align*}\frac{1}{n}\end{align*} makes the variance too small. Research and theory progressed until it was realized that multiplying the sum of the square differences by \begin{align*}\frac{1}{n-1}\end{align*} made the fraction slightly larger and properly estimated the variance of the population. Thus, there are two ways to calculate variance, one for populations and one for samples.

Hey wait, by squaring the differences, doesn’t that mean that the units are squared? What if I want to describe the spread in the regular units? Should I just take the square root of the variance?

This is why the Greek letter lowercase sigma, \begin{align*}\sigma\end{align*}, is used for standard deviation of a population (which is the square root of the variance) and \begin{align*}\sigma ^2\end{align*} is the symbol for variance of a population. The letters \begin{align*}s\end{align*}, \begin{align*}s^2\end{align*} are used for sample standard deviation and sample variance. The Greek letter mu, \begin{align*}\mu\end{align*}, is the symbol used for mean of a population, while  \begin{align*}\overline{x}\end{align*} is the symbol used for mean of a sample.

Mean and variance for the population: \begin{align*}x_1, x_2, x_3, \ldots, x_n\end{align*}

\begin{align*}\mu&=\frac{1}{n} \cdot \sum\limits_{i=1}^n x_i\\ \sigma^2&=\frac{1}{n} \cdot \sum\limits_{i=1}^n (\mu - x_i)^2 \end{align*}

Mean and variance for a sample from a population: \begin{align*}x_1, x_2, x_3, \ldots , x_m\end{align*}

\begin{align*}\overline{x}&=\frac{1}{m} \cdot \sum\limits_{i=1}^m x_i\\ s^2&=\frac{1}{m-1} \cdot \sum\limits_{i=1}^m (\overline{x} - x_i)^2 \end{align*}

Remember that variance is a measure of the spread of data. The bigger the variance, the more spread out the data points.

Take a six sided dice. Since the population for a six sided die is entirely known, you would use the population variance to calculate the variance and mean. You would get.

 \begin{align*}\mu=\frac{1}{6} (1+2+3+4+5+6)=\frac{1}{6}\cdot 21=3.5\end{align*}

\begin{align*}\sigma^2 &=\frac{1}{6}\left[(3.5-1)^2+(3.5-2)^2+(3.5-3)^2+(3.5-4)^2+(3.5-5)^2+(3.5-6)^2 \right] \\ &=\frac{1}{6}[6.25+2.25+0.25+0.25+2.25+6.25] \\ & \approx 2.9167\end{align*}

Examples

Example 1

Earlier, you were asked to find the mean and variance for the following sample test scores taken from a larger student population:

75, 73, 78, 90, 60, 51, 87, 79, 80, 77

The mean of the test scores is 75. The variance is calculated by taking the difference of each number from the mean, squaring and summing these differences.

\begin{align*}0^2+2^2+3^2+15^2+15^2+24^2+12^2+4^2+5^2+2^2=1228\end{align*}

Since this data is a sample, you divide the sum by one fewer than the number of terms.

\begin{align*}\frac{1228}{10-1}\approx 136.4444\end{align*}

If you knew the variances for two samples, each from a different class, you could quickly determine which class had test scores that were more spread out.

Example 2

Calculate the mean and variance of the following data sample of lap times.

59.8, 57.1, 58.2, 58.6, 57.8, 57.9, 58.0, 57.3

 \begin{align*}\overline{x}=\frac{1}{8}(59.8+57.1+58.2+ 58.6+57.8+57.9+58.0+57.3)=58.0875\end{align*}

This is a sample, so you should use the sample variance formula.

\begin{align*}s^2 &=\frac{1}{8-1} \cdot \left[(\mu-59.8)^2+(\mu-57.1)^2+(\mu-58.2)^2+(\mu-58.6)^2+(\mu-57.8)^2 \right .\\ & \quad \left . +(\mu-57.9)^2+(\mu-58.0)^2+(\mu-57.3)^2 \right] \\ &=\frac{1}{7} \left[(-1.7125)^2+0.9875^2+(-0.1125)^2+(-0.5125)^2+0.2875^2+0.1875^2+0.0875^2+0.7875^2\right] \\ & \approx \frac{1}{7}[2.9327+0.9751+0.0126+0.2626+0.0826+0.0351+0.0076+0.6201] \\ & \approx \frac{1}{7}[4.9288]\\ & \approx 0.7041 \end{align*}

Example 3

Use a calculator to calculate the variance from Example 2.

To calculate variance on your calculator, enter the data in a list, choose 1-Var Stats and run the 1-Var Stats on the list you entered the data.


The two outputs that are important for you to interpret are:

\begin{align*}Sx = 0.839110924\end{align*}

\begin{align*}\sigma x =0.7848163968\end{align*}

Since the calculator does not know whether the data is a population or a sample, it produces both. Since this problem is about a sample, the number of interest is \begin{align*}Sx\end{align*}. This number does not match the variance from Example B because it is the sample standard deviation which means it is the square root of the sample variance. The calculator produces standard deviation. You need to square that number to produce the appropriate variance.

\begin{align*}0.8391^2\approx 0.7041\end{align*}

Example 4

Calculate the standard deviation for the following 6 numbers by hand. Assume the numbers are a population.

2, 4, 6, 8, 12, 19

\begin{align*}\mu &=\frac{1}{6}(2+4+6+8+12+17)=8 \\ \sigma^2 &=\frac{1}{6}((8-2)^2+(8-4)^2+(8-6)^2+0+(8-12)^2+(8-17)^2) \\ & =\frac{1}{6}(6^2+4^2+2^2+4^2+9^2 )\\ & =\frac{1}{6}(36+16+4+16+81) \\ & =\frac{1}{6}(153) \\ & =25.5 \\ \sigma &\approx 5.0498 \end{align*}

Example 5

Use a spreadsheet to organize your calculations for computing the variance of the following numbers. Assume these numbers are a true population.

14, 15, 7, 15, 2, 0, 6, 5, 12, 3

After entering the data in a column, you can use the power of the embedded programming of the spreadsheet to make a second column of just the average.

  • The average command is: “\begin{align*}=\text{average}\text{(A2:A11)}\end{align*}

You can subtract one cell from another cell to find the difference. You can then square the difference to find the difference squared. You can then sum these values using the sum command.

  • The sum command is: “\begin{align*}\text{= sum(D2:D11)}\end{align*}

Finally, just divide the sum by the number of observations (which is 10) to get the variance.

Review

1. What are the similarities and differences between standard deviation and variance?

2. Data Set A has a mean of 30 and a standard deviation of 10. Data Set B also has a mean of 30, but a standard deviation of 2. What does this mean about Data Set A compared to Data Set B?

Calculate the variance of each set of data by hand.

3. Sample: 1, 4, 7, 10, 3, 6, 12, 5, 8, 16, 21, 3, 1, 5

4. Population: 23, 27, 19, 24, 20, 22, 31, 30, 28

5. Sample: 64, 62, 60, 58, 54, 60, 61, 63, 47, 100, 29, 59

Calculate the variance of each set of data using your calculator. Compare your answers to your answers to 3-5.

6. Sample: 1, 4, 7, 10, 3, 6, 12, 5, 8, 16, 21, 3, 1, 5

7. Population: 23, 27, 19, 24, 20, 22, 31, 30, 28

8. Sample: 64, 62, 60, 58, 54, 60, 61, 63, 47, 100, 29, 59

9. If \begin{align*}\sigma^2=16\end{align*}, what is the population standard deviation?

10. Which data set has the largest standard deviation?

  1. 10 10 10 10 10
  2. 0 0 10 10 10
  3. 0 9 10 11 20
  4. 20 20 20 20 20

11. What will a large variance look like on a histogram? What will a small variance look like on a histogram?

12. You find some data organized in a bar graph. Could you calculate the variance of this data? Explain.

13. A sample set of 20 exam scores is 67, 94, 88, 76, 85, 93, 55, 87, 80, 81, 80, 61, 90, 84, 75, 93, 75, 68, 100, 98. Calculate the mean, variance, and standard deviation for this data.

14. All of Mike’s bowling scores are: 1, 1, 2, 10, 12, 1, 9, 6, 7, 8, 4, 3, 4, 1, 4, 1, 6, 7, 11, 5. Calculate the mean, variance, and standard deviation for this data.

15. Why can’t you always calculate the population variance and standard deviation? Why do you sometimes have to calculate the sample variance and standard deviation?

Review (Answers)

To see the Review answers, open this PDF file and look for section 15.5. 

Notes/Highlights Having trouble? Report an issue.

Color Highlighted Text Notes
Please to create your own Highlights / Notes
Show More

Vocabulary

absolute deviation

The absolute deviation is the sum total of how different each number is from the mean.

deviation

Deviation is a measure of the difference between a given value and the mean.

Mean

The mean of a data set is the average of the data set. The mean is found by calculating the sum of the values in the data set and then dividing by the number of values in the data set.

mean absolute deviation

The mean absolute deviation is an alternate measure of how spread out the data is. It involves finding the mean of the distance between each data value and the mean. While this method might seem more intuitive, in statistics it has been found to be too limited and is not commonly used.

Population

In statistics, the population is the entire group of interest from which the sample is drawn.

Sample

A sample is a specified part of a population, intended to represent the population as a whole.

Skew

To skew a given set means to cause the trend of data to favor one end or the other

standard deviation

The square root of the variance is the standard deviation. Standard deviation is one way to measure the spread of a set of data.

variance

A measure of the spread of the data set equal to the mean of the squared variations of each data value from the mean of the data set.

Image Attributions

Explore More

Sign in to explore more, including practice questions and solutions for Variance of a Data Set.
Please wait...
Please wait...