2.3: Box-and-Whisker Plots
Learning Objectives
- Calculate the values of the five-number summary.
- Draw and translate data sets to and from a box-and-whisker plot.
- Interpret the shape of a box-and-whisker plot.
- Compare distributions of univariate data (shape, center, spread, and outliers).
- Describe the effects of changing units on summary measures.
Introduction
In this section, the box-and-whisker plot will be introduced, and the basic ideas of shape, center, spread, and outliers will be studied in this context.
The Five-Number Summary
The five-number summary is a numerical description of a data set comprised of the following measures (in order): minimum value, lower quartile, median, upper quartile, maximum value.
Example: The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the reservoir capacities of the major water sources for Arizona:
Lake/Reservoir | % of Capacity |
---|---|
Salt River System | 59 |
Lake Pleasant | 49 |
Verde River System | 33 |
San Carlos | 9 |
Lyman Reservoir | 3 |
Show Low Lake | 51 |
Lake Havasu | 98 |
Lake Mohave | 85 |
Lake Mead | 95 |
Lake Powell | 89 |
Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html
This data set was collected in 1998, and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than 50% of capacity .
Placing the data in order from smallest to largest gives the following:
3, 9, 33, 49, 51, 59, 85, 89, 95, 98
Since there are 10 numbers, the median is the average of 51 and 59, which is 55. Recall that the lower quartile is the percentile, or where 25% of the data is below that value. In this data set, that number is 33. Also, the upper quartile is 89. Therefore, the five-number summary is as shown:
Box-and-Whisker Plots
A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the 'box' part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then connect the ends of the box to the minimum and maximum values using line segments to form the 'whiskers'. Here is the box plot for this data:
The plot divides the data into quarters. If the number of data points is divisible by 4, then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are 10 data points, the number of values in each section will only be approximately the same, but about 25% of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. In other words, there is roughly the same amount of data in each section. The different lengths of the sections tell us how the data are spread in each section. The numbers in the left whisker (lowest 25% of the data) are spread more widely than those in the right whisker.
Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:
In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data values in the lower whisker also appear to be much more widely spread than in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed-left appearance (though it is still roughly symmetric).
Comparing Multiple Box Plots
Box-and-whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets. In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:
The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or interquartile range). This tells us how the middle 50% of the data is spread, and Arizona's data values appear to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona's capacities are lower. Recall that the median is a resistant measure of center, because it is not affected by outliers. The mean is not resistant, because it will be pulled toward outlying points. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.
Even though we wouldn't characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.
Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median, due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.
Outliers in Box-and-Whisker Plots
Here are the reservoir data for California (the names of the lakes and reservoirs have been omitted):
80, 83, 77, 95, 85, 74, 34, 68, 90, 82, 75
At first glance, the 34 should stand out. It appears as if this point is significantly different from the rest of the data. Let's use a graphing calculator to investigate this plot. Enter your data into a list as we have done before, and then choose a plot. Under 'Type', you will notice what looks like two different box and whisker plots. For now choose the second one (even though it appears on the second line, you must press the right arrow to select these plots).
Setting a window is not as important for a box plot, so we will use the calculator's ability to automatically scale a window to our data by pressing [ZOOM] and selecting '9:Zoom Stat'.
While box plots give us a nice summary of the important features of a distribution, we lose the ability to identify individual points. The left whisker is elongated, but if we did not have the data, we would not know if all the points in that section of the data were spread out, or if it were just the result of the one outlier. It is more typical to use a modified box plot. This box plot will show an outlier as a single, disconnected point and will stop the whisker at the previous point. Go back and change your plot to the first box plot option, which is the modified box plot, and then graph it.
Notice that without the outlier, the distribution is really roughly symmetric.
This data set had one obvious outlier, but when is a point far enough away to be called an outlier? We need a standard accepted practice for defining an outlier in a box plot. This rather arbitrary definition is that any point that is more than 1.5 times the interquartile range will be considered an outlier. Because the is the same as the length of the box, any point that is more than one-and-a-half box lengths from either quartile is plotted as an outlier.
A common misconception of students is that you stop the whisker at this boundary line. In fact, the last point on the whisker that is not an outlier is where the whisker stops.
The calculations for determining the outlier in this case are as follows:
Lower Quartile: 74
Upper Quartile: 85
Interquartile range
Cut-off for outliers in left whisker: . Thus, any value less than 57.5 is considered an outlier.
Notice that we did not even bother to test the calculation on the right whisker, because it should be obvious from a quick visual inspection that there are no points that are farther than even one box length away from the upper quartile.
If you press [TRACE] and use the left or right arrows, the calculator will trace the values of the five-number summary, as well as the outlier.
The Effects of Changing Units on Shape, Center, and Spread
In the previous lesson, we looked at data for the materials in a typical desktop computer.
Material | Kilograms |
---|---|
Plastics | 6.21 |
Lead | 1.71 |
Aluminum | 3.83 |
Iron | 5.54 |
Copper | 2.12 |
Tin | 0.27 |
Zinc | 0.60 |
Nickel | 0.23 |
Barium | 0.05 |
Other elements and chemicals | 6.44 |
Here is the data set given in pounds. The weight of each in kilograms was multiplied by 2.2.
Material | Pounds |
---|---|
Plastics | 13.7 |
Lead | 3.8 |
Aluminum | 8.4 |
Iron | 12.2 |
Copper | 4.7 |
Tin | 0.6 |
Zinc | 1.3 |
Nickel | 0.5 |
Barium | 0.1 |
Other elements and chemicals | 14.2 |
When all values are multiplied by a factor of 2.2, the calculation of the mean is also multiplied by 2.2, so the center of the distribution would be increased by the same factor. Similarly, calculations of the range, interquartile range, and standard deviation will also be increased by the same factor. In other words, the center and the measures of spread will increase proportionally.
Example: This is easier to think of with numbers. Suppose that your mean is 20, and that two of the data values in your distribution are 21 and 23. If you multiply 21 and 23 by 2, you get 42 and 46, and your mean also changes by a factor of 2 and is now 40. Before your deviations were and , but now, your deviations are and , so your deviations are getting twice as big as well.
This should result in the graph maintaining the same shape, but being stretched out, or elongated. Here are the side-by-side box plots for both distributions showing the effects of changing units.
On the Web
http://tinyurl.com/34s6sm Investigate the mean, median and box plots.
http://tinyurl.com/3ao9px More investigation of boxplots.
Lesson Summary
The five-number summary is a useful collection of statistical measures consisting of the following in ascending order: minimum, lower quartile, median, upper quartile, maximum. A box-and-whisker plot is a graphical representation of the five-number summary showing a box bounded by the lower and upper quartiles and the median as a line in the box. The whiskers are line segments extended from the quartiles to the minimum and maximum values. Each whisker and section of the box contains approximately 25% of the data. The width of the box is the interquartile range, or , and shows the spread of the middle 50% of the data. Box-and-whisker plots are effective at giving an overall impression of the shape, center, and spread of a data set. While an outlier is simply a point that is not typical of the rest of the data, there is an accepted definition of an outlier in the context of a box-and-whisker plot. Any point that is more than 1.5 times the length of the box from either end of the box is considered to be an outlier. When changing the units of a distribution, the center and spread will be affected, but the shape will stay the same.
Points to Consider
- What characteristics of a data set make it easier or harder to represent it using dot plots, stem-and-leaf plots, histograms, and box-and-whisker plots?
- Which plots are most useful to interpret the ideas of shape, center, and spread?
- What effects do other transformations of the data have on the shape, center, and spread?
Multimedia Links
For a description of how to draw a box-and-whisker plot from given data (14.0), see patrickJMT, Box and Whisker Plot (5:53).
Review Questions
- Here are the 1998 data on the percentage of capacity of reservoirs in Idaho.
- Find the five-number summary for this data set.
- Show all work to determine if there are true outliers according to the rule.
- Create a box-and-whisker plot showing any outliers.
- Describe the shape, center, and spread of the distribution of reservoir capacities in Idaho in 1998.
- Based on your answer in part (d), how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
- Here are the 1998 data on the percentage of capacity of reservoirs in Utah.
- Find the five-number summary for this data set.
- Show all work to determine if there are true outliers according to the rule.
- Create a box-and-whisker plot showing any outliers.
- Describe the shape, center, and spread of the distribution of reservoir capacities in Utah in 1998.
- Based on your answer in part (d) how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
- Graph the box plots for Idaho and Utah on the same axes. Write a few statements comparing the water levels in Idaho and Utah by discussing the shape, center, and spread of the distributions.
- If the median of a distribution is less than the mean, which of the following statements is the most correct?
- The distribution is skewed left.
- The distribution is skewed right.
- There are outliers on the left side.
- There are outliers on the right side.
- (b) or (d) could be true.
- The following table contains recent data on the average price of a gallon of gasoline for states that share a border crossing into Canada.
- Find the five-number summary for this data.
- Show all work to test for outliers.
- Graph the box-and-whisker plot for this data.
- Canadian gasoline is sold in liters. Suppose a Canadian crossed the border into one of these states and wanted to compare the cost of gasoline. There are approximately 4 liters in a gallon. If we were to convert the distribution to liters, describe the resulting shape, center, and spread of the new distribution.
- Complete the following table. Convert to cost per liter by dividing by 3.7854, and then graph the resulting box plot.
As an interesting extension to this problem, you could look up the current data and compare that distribution with the data presented here. You could also find the exchange rate for Canadian dollars and convert the prices into the other currency.
State | Average Price of a Gallon of Gasoline (US$) | Average Price of a Liter of Gasoline (US$) |
---|---|---|
Alaska | 3.458 | |
Washington | 3.528 | |
Idaho | 3.26 | |
Montana | 3.22 | |
North Dakota | 3.282 | |
Minnesota | 3.12 | |
Michigan | 3.352 | |
New York | 3.393 | |
Vermont | 3.252 | |
New Hampshire | 3.152 | |
Maine | 3.309 |
Average Prices of a Gallon of Gasoline on March 16, 2008
Figure: Average prices of a gallon of gasoline on March 16, 2008. Source: AAA, http://www.fuelgaugereport.com/sbsavg.asp
References
Kunzig, Robert. Drying of the West. National Geographic, February 2008, Vol. 213, No. 2, Page 94.