In this Concept, the box-and-whisker plot will be introduced, and the basic ideas of shape, center, spread. We will also compare more than one box-and-whisker plot together.
Watch This
For a description of how to draw a box-and-whisker plot from given data (14.0) , see patrickJMT, Box and Whisker Plot (5:53).
Guidance
The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the amount of water held by each major reservoir in Arizona stated as a percentage of that reservoir's total capacity.
Lake/Reservoir | % of Capacity |
---|---|
Salt River System | 59 |
Lake Pleasant | 49 |
Verde River System | 33 |
San Carlos | 9 |
Lyman Reservoir | 3 |
Show Low Lake | 51 |
Lake Havasu | 98 |
Lake Mohave | 85 |
Lake Mead | 95 |
Lake Powell | 89 |
Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html
This data set was collected in 1998, and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than 50% of capacity. What would be a good way to summarize this data and display it visually?
The Five-Number Summary
The five-number summary is a numerical description of a data set comprised of the following measures (in order): minimum value, lower quartile, median, upper quartile, maximum value.
Example A
Find the Five-Number Summary for the reservoir capacities of the major water sources for Arizona, as shown above.
Solution:
Placing the data in order from smallest to largest gives the following:
3, 9, 33, 49, 51, 59, 85, 89, 95, 98
Since there are 10 numbers, the median is the average of 51 and 59, which is 55. Recall that the lower quartile is the percentile, or where 25% of the data is below that value. In this data set, that number is 33. Also, the upper quartile is 89. Therefore, the five-number summary is as shown:
Next we want to think about how we can display this information, to learn from it visually.
Box-and-Whisker Plots
A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the 'box' part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then connect the ends of the box to the minimum and maximum values using line segments to form the 'whiskers'.
Example B
Create a box plot for the reservoir capacities of the major water sources for Arizona.
Solution:
Here is the box plot for this data:
The plot divides the data into quarters. If the number of data points is divisible by 4, then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are 10 data points, the number of values in each section will only be approximately the same, but about 25% of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. In other words, there is roughly the same amount of data in each section. The different lengths of the sections tell us how the data are spread in each section. The numbers in the left whisker (lowest 25% of the data) are spread more widely than those in the right whisker.
How does this box-and-whisker plot compare to other box-and-whisker plots? Let's look at another example.
Example C
Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:
In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data values in the lower whisker also appear to be much more widely spread than in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed-left appearance (though it is still roughly symmetric).
Comparing Multiple Box Plots
We have looked at box plots for reservoirs in Arizona and Colorado, individually. Box-and-whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets.
Example D
In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:
The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or interquartile range). This tells us how the middle 50% of the data is spread, and Arizona's data values appear to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona's reservoirs are less full, as a percentage of their individual capacities, than Colorado's. Recall that the median is a resistant measure of center, because it is not affected by outliers. The mean is not resistant, because it will be pulled toward outlying points. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.
Even though we wouldn't characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.
Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median, due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.
Guided Practice
Given the following five number summary:
Median: 176
Quartiles: 154 189
Extremes: 122 224
a. Find the value of the range for these data.
b. About what percent of the data is in the interval 154 to 189?
c. Draw a box and whisker plot for this data.
Solutions:
a. The range for these data is 224 – 122 = 102.
b. The interval 154 to 189 is the interval between the first and third quartiles. There is always 50% of the data between these two quartiles.
c. Here is the box-and-whisker plot:
Explore More
For 1-4, here are the 1998 data on the percentage of capacity of reservoirs in Idaho.
- Find the five-number summary for this data set.
- Show all work to determine if there are true outliers according to the rule.
- Describe the shape, center, and spread of the distribution of reservoir capacities in Idaho in 1998.
- Based on your answer in part (3), how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
For 5-8, here are the 1998 data on the percentage of capacity of reservoirs in Utah.
- Find the five-number summary for this data set.
- Show all work to determine if there are true outliers according to the rule.
- Describe the shape, center, and spread of the distribution of reservoir capacities in Utah in 1998.
- Based on your answer in part (3) how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
- Graph the box plots for Idaho and Utah on the same axes. Write a few statements comparing the water levels in Idaho and Utah by discussing the shape, center, and spread of the distributions.