2.3: Box-and-Whisker Plots
Learning Objectives
- Calculate the values of the number summary.
- Draw and translate data sets to and from a box-and-whisker plot.
- Interpret the shape of a box-and-whisker plot.
- Compare distributions of univariate data (shape, center, spread, and outliers).
- Describe the effects of changing units on summary measures.
Introduction
In this section we will round out our investigation of different types of visual displays by introducing the box-and-whisker plots. The basic ideas of shape, center, spread, and outliers will be investigated in this context and students will be asked to become proficient in translating and interpreting graphs of univariate data of the various types.
The Five-Number Summary
The five-number summary is a numerical description of a data set comprised of the following measures (in order):
Minimum value, lower quartile, median, upper quartile, maximum value.
In order to review finding these values from Chapter One, let’s turn to another recycling/conservation related issue. The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the reservoir capacities of the major water sources for Arizona:
Lake/Reservoir | of Capacity |
---|---|
Salt River System | |
Lake Pleasant | |
Verde River System | |
San Carlos | |
Lyman Reservoir | |
Show Low Lake | |
Lake Havasu | |
Lake Mohave | |
Lake Mead | |
Lake Powell |
Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html
This data was collected in 1998 and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than of capacity.
Placing the data in order from smallest to largest gives:
With numbers, the median would be between and , or . Recall that the lower quartile is the percentile, or where of the data is below that value. In this data set, that number is . Similarly, the upper quartile is . Therefore the five-number summary is:
Box-and-Whisker Plots
A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the “box” part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then we connect the ends of the box to the minimum and maximum values using a line segment to form the “whisker”. Here is the box plot for this data:
The plot divides the data into quarters. If the number of data points is divisible by , then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are data points, it will only be approximately the same, but approximately of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. Of course, it could be uniform or bimodal also, so you would have to also look at a dot plot or histogram to get a more complete picture of the shape.
In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. Many students initially incorrectly interpret this to mean that longer sections contain more data and shorter ones contain less. This is not true and it is important to remember that roughly the same amount of data is in each section. What this does tell us is how the data is spread in each of those sections. The numbers in the left whisker (lowest of the data) are spread more widely than those in the right whisker.
Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:
In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data in the lower whisker also appears to be much more widely spread than it is in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed left appearance (though it is still roughly symmetric).
Comparing Multiple Box Plots: Resistance Revisited
Box and Whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets. In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:
The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or Interquartile Range). This tells us how the middle of the data is spread, and Arizona’s appears to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona’s capacities are lower. In the first chapter we talked about the concept of resistance. Recall that the median is a resistant measure of center because it is not affected by outliers, but the mean is not resistant because it will be pulled toward outlying points. This is also true of skewed data. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.
Even though we wouldn’t characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.
Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.
Outliers in Box-and-Whisker Plots
Here is the reservoir data for California (the names of the lakes and reservoirs have been omitted):
At first glance, the 34 should stand out. It appears as if this point is significantly isolated from the rest of the data, which is the textbook definition of an outlier. Let’s use a graphing calculator to investigate this plot. Enter your data into a list as we have done before, and then choose a plot. Under Type, you will notice what looks like two different box and whisker plots. For now choose the second one (even though it appears on the second line, you must press the right arrow to select these plots).
Setting a window is not as important for a box plot, so we will use the calculator’s ability to automatically scale a window to our data by pressing [ZOOM] and select number (ZoomStat).
While box plots give us a nice summary of the important features of a distribution, we lose the ability to identify individual points. The left whisker is elongated, but if we did not have the data, we would not know if all the points in that section of the data were spread out, or if it were just the result of the one outlier. It is more typical to use a modified box plot. This box plot will show an outlier as a single, disconnected point and will stop the whisker at the previous point. Go back and change your plot to the first box plot option, which is the modified box plot, and press then graph it.
Notice that without the outlier, the distribution is really roughly symmetric.
This data set had one obvious outlier, but when is a point far enough away to be called an outlier? We need a standard accepted practice for defining an outlier in a box plot. This rather arbitrary definition is that any point that is more than the Interquartile Range will be considered an outlier. Because the IQR is the same as the length of the box, any point that is more than 1 and a half box lengths from either quartile is plotted as an outlier.
A common misconception of students is that you stop the whisker at this boundary line. In fact, the last point on the whisker that is not an outlier is where the whisker stops.
The calculations for determining the outlier in this case are as follows:
Lower Quartile:
Upper Quartile:
Interquartile range(IQR):
Cut-off for outliers in left whisker:
Notice that we did not even bother to test the calculation on the right whisker because it should be obvious from a quick visual inspection that there are no points that are farther than even one box length away from the upper quartile.
If you press [ZOOM], and use the left or right arrows, the calculator will trace the values of the five-number summary, as well as the last point on the left whisker.
The Effects of Changing Units on Shape, Center, and Spread
In the previous lesson, we looked at data for the materials in a typical desktop computer.
Material | Kilograms |
---|---|
Plastics | |
Lead | |
Aluminum | |
Iron | |
Copper | |
Tin | |
Zinc | |
Nickel | |
Barium | |
Other elements and chemicals |
Here is a similar set of data given in pounds.
Material | Pounds |
---|---|
Plastics | |
Lead | |
Aluminum | |
Iron | |
Copper | |
Tin | |
Zinc | |
Nickel | |
Barium | |
Other elements and chemicals |
The source of this data set was in India, so like much of the rest of the world, the data was given in metric units, or kilograms. If we want to convert these weights to pounds, what would be different about this distribution? To convert from kilograms to pounds, we multiply the number of kilograms times . Think about how, if at all, the shape, center, and spread would change. If you multiple all values by a factor of , then the calculation of the mean would also be multiplied by , so the center of the distribution should be increased by the same factor. Similarly, calculations of the range, interquartile range, and standard deviation will also be increased by the same factor. So the center and the measures of spread will increase proportionally. This should result in the graph maintaining the same shape, but being stretched out or elongated. Here are the side-by-side box plots for both distributions showing the effects of changing units.
Lesson Summary
The five-number summary is useful collection of statistical measures consisting of the following in ascending order:
Minimum, lower quartile, median, upper quartile, maximum
A Box-and-Whisker Plot is a graphical representation of the five-number summary showing a box bounded by the lower and upper quartiles and the median as a line in the box. The whiskers are line segments extended from the quartiles to the minimum and maximum values. Each whisker and section of the box contains approximately of the data. The width of the box is the interquartile range (IQR), and shows the spread of the middle of the data. Box-and-whisker plots are effective at giving an overall impression of the shape, center, and spread. While an outlier is simply a point that is not typical of the rest of the data, there is an accepted definition of an outlier in the context of a box-and-whisker plot. Any point that is more than the length of the box (IQR) from either end of the box, is considered to be an outlier. When changing units of a distribution, the center and spread will be affected, but the shape will stay the same.
Points to Consider
- What characteristics of a data set make it easier or harder to represent it using dot plots, stem and leaf plots, histograms, and box and whisker plots?
- Which plots are most useful to interpret the ideas of shape, center, and spread?
- What effects do other transformations of the data have on the shape, center, and spread?
Review Questions
- Here is the 1998 data on the percentage of capacity of reservoirs in Idaho.
- Find the five-number summary for this data set.
- Show all work to determine if there are true outliers according to the rule.
- Create a box-and-whisker plot showing any outliers.
- Describe the shape, center, and spread of the distribution of reservoir capacities in Idaho in 1998.
- Based on your answer in part d., how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
- Here is the 1998 data on the percentage of capacity of reservoirs in Utah.
- Find the five-number summary for this data set.
- Show all work to determine if there are true outliers according to the rule.
- Create a box-and-whisker plot showing any outliers.
- Describe the shape, center, and spread of the distribution of reservoir capacities in Utah in 1998.
- Based on your answer in part d., how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
- Graph the box plots for Idaho and Utah on the same axes. Write a few statements comparing the water levels in Idaho and Utah by discussing the shape, center, and spread of the distributions.
- If the median of a distribution is less than the mean, which of the following statements is the most correct?
- The distribution is skewed left.
- The distribution is skewed right.
- There are outliers on the left side.
- There are outliers on the right side.
- b or d could be true.
- The following table contains recent data on the average price of a gallon of gasoline for states that share a border crossing into Canada.
- Find the five-number summary for this data.
- Show all work to test for outliers.
- Graph the box-and-whisker plot for this data.
- Canadian gasoline is sold in liters. Suppose a Canadian crossed the border into one of these states and wanted to compare the cost of gasoline. There are approximately in a gallon. If we were to convert the distribution to liters, describe the resulting shape, center, and spread of the new distribution.
- Complete the following table. Convert to cost per liter by dividing by and then graph the resulting box plot.
As an interesting extension to this problem, you could look up the current data and compare that distribution with the data presented here. You could also find the exchange rate for Canadian dollars and convert the prices into the other currency.
State | Average Price of a Gallon of Gasoline | Average Price of a Liter of Gasoline |
---|---|---|
Alaska | ||
Washington | ||
Idaho | ||
Montana | ||
North Dakota | ||
Minnesota | ||
Michigan | ||
New York | ||
Vermont | ||
New Hampshire | ||
Maine |
Average Prices of a Gallon of Gasoline on March 16, 2008
Figure: Average prices of a gallon of gasoline on March 16, 2008. Source: AAA, http://www.fuelgaugereport.com/sbsavg.asp
Review Answers
- (a) (b) Upper bound for outliers . There is no data above Lower bound for outliers . There is no data below , so there are no outliers. (c) (d) The distribution of Idaho reservoir capacities is roughly symmetric and is centered somewhere in the middle 60 percents. The capacities range from up to and the middle of the data is between and . (e) The data between the median and the upper quartile is slightly more compressed which causes the median to be slightly larger than the median. The mean is approximately .
- (a) (b) Upper bound for outliers . There is no data above Lower bound for outliers is the last value that is above this point, so , and 46 are all outliers . (c) (d) The distribution of Utah reservoir capacities has three outliers. If those points were removed, the distribution is roughly symmetric and is centered somewhere in the low percents. The capacities range from up to including the outliers and the middle of the data is between and . (e) There are three extreme outliers, the mean is not resistant to the pull of outliers and there will be significantly lower than the median. The mean is approximately .
- If we disregard the three outliers, the distribution of water capacities in Utah is higher than that of Idaho at every point in the five number summary. From this we might conclude it is centered higher and that the reservoir system is overall at safer levels in Utah, than in Idaho. Again eliminating the outliers, both distributions are roughly symmetric and their spreads are also fairly similar. The middle of the data for Utah is just slightly more closely grouped.
- (e) If the mean is greater than the median, then it has been pulled to the right either by an outlier, or by a skewed right shape. The median will not be affected by either of those things.
- (a) (b) Upper bound for outliers . There is no data above Lower bound for outliers . There is no data below , so there are no outliers. (c) (d) By dividing the data by , we will obtain the average cost per liter. The mean, median will be decreased, being divided by . The same is true for the measures of spread (range, IQR, and standard deviation), which will result in the data being compressed into a smaller area if we were to graph both distributions on the same scale. The shape of the distributions will remain the same. (e)
State | Average Price of a Gallon of Gasoline | Average Price of a Liter of Gasoline |
---|---|---|
Alaska | ||
Washington | ||
Idaho | ||
Montana | ||
North Dakota | ||
Minnesota | ||
Michigan | ||
New York | ||
Vermont | ||
New Hampshire | ||
Maine |
References
Kunzig, Robert. Drying of the West. National Geographic, February 2008, Vol. 213, No. 2, Page 94.