<meta http-equiv="refresh" content="1; url=/nojavascript/">

# Box-and-Whisker Plots

%
Progress
Practice Box-and-Whisker Plots
Progress
%
Box-and-Whisker Plots

In this Concept, the box-and-whisker plot will be introduced, and the basic ideas of shape, center, spread. We will also compare more than one box-and-whisker plot together.

### Watch This

For a description of how to draw a box-and-whisker plot from given data (14.0) , see patrickJMT, Box and Whisker Plot (5:53).

### Guidance

The huge population growth in the western United States in recent years, along with a trend toward less annual rainfall in many areas and even drought conditions in others, has put tremendous strain on the water resources available now and the need to protect them in the years to come. Here is a listing of the amount of water held by each major reservoir in Arizona stated as a percentage of that reservoir's total capacity.

Lake/Reservoir % of Capacity
Salt River System 59
Lake Pleasant 49
Verde River System 33
San Carlos 9
Lyman Reservoir 3
Show Low Lake 51
Lake Havasu 98
Lake Mohave 85
Lake Powell 89

Figure: Arizona Reservoir Capacity, 12 / 31 / 98. Source: http://www.seattlecentral.edu/qelp/sets/008/008.html

This data set was collected in 1998, and the water levels in many states have taken a dramatic turn for the worse. For example, Lake Powell is currently at less than 50% of capacity. What would be a good way to summarize this data and display it visually?

The Five-Number Summary

The five-number summary is a numerical description of a data set comprised of the following measures (in order): minimum value, lower quartile, median, upper quartile, maximum value.

#### Example A

Find the Five-Number Summary for the reservoir capacities of the major water sources for Arizona, as shown above.

Solution:

Placing the data in order from smallest to largest gives the following:

3, 9, 33, 49, 51, 59, 85, 89, 95, 98

Since there are 10 numbers, the median is the average of 51 and 59, which is 55. Recall that the lower quartile is the $25^{\text{th}}$ percentile, or where 25% of the data is below that value. In this data set, that number is 33. Also, the upper quartile is 89. Therefore, the five-number summary is as shown:

$\left \{3, 33, 55, 89, 98 \right \}$

Next we want to think about how we can display this information, to learn from it visually.

Box-and-Whisker Plots

A box-and-whisker plot is a very convenient and informative way to represent single-variable data. To create the 'box' part of the plot, draw a rectangle that extends from the lower quartile to the upper quartile. Draw a line through the interior of the rectangle at the median. Then connect the ends of the box to the minimum and maximum values using line segments to form the 'whiskers'.

#### Example B

Create a box plot for the reservoir capacities of the major water sources for Arizona.

Solution:

Here is the box plot for this data:

The plot divides the data into quarters. If the number of data points is divisible by 4, then there will be exactly the same number of values in each of the two whiskers, as well as the two sections in the box. In this example, because there are 10 data points, the number of values in each section will only be approximately the same, but about 25% of the data appears in each section. You can also usually learn something about the shape of the distribution from the sections of the plot. If each of the four sections of the plot is about the same length, then the data will be symmetric. In this example, the different sections are not exactly the same length. The left whisker is slightly longer than the right, and the right half of the box is slightly longer than the left. We would most likely say that this distribution is moderately symmetric. In other words, there is roughly the same amount of data in each section. The different lengths of the sections tell us how the data are spread in each section. The numbers in the left whisker (lowest 25% of the data) are spread more widely than those in the right whisker.

How does this box-and-whisker plot compare to other box-and-whisker plots? Let's look at another example.

#### Example C

Here is the box plot (as the name is sometimes shortened) for reservoirs and lakes in Colorado:

In this case, the third quarter of data (between the median and upper quartile), appears to be a bit more densely concentrated in a smaller area. The data values in the lower whisker also appear to be much more widely spread than in the other sections. Looking at the dot plot for the same data shows that this spread in the lower whisker gives the data a slightly skewed-left appearance (though it is still roughly symmetric).

Comparing Multiple Box Plots

We have looked at box plots for reservoirs in Arizona and Colorado, individually. Box-and-whisker plots are often used to get a quick and efficient comparison of the general features of multiple data sets.

#### Example D

In the previous example, we looked at data for both Arizona and Colorado. How do their reservoir capacities compare? You will often see multiple box plots either stacked on top of each other, or drawn side-by-side for easy comparison. Here are the two box plots:

The plots seem to be spread the same if we just look at the range, but with the box plots, we have an additional indicator of spread if we examine the length of the box (or interquartile range). This tells us how the middle 50% of the data is spread, and Arizona's data values appear to have a wider spread. The center of the Colorado data (as evidenced by the location of the median) is higher, which would tend to indicate that, in general, Arizona's reservoirs are less full, as a percentage of their individual capacities, than Colorado's. Recall that the median is a resistant measure of center, because it is not affected by outliers. The mean is not resistant, because it will be pulled toward outlying points. When a data set is skewed strongly in a particular direction, the mean will be pulled in the direction of the skewing, but the median will not be affected. For this reason, the median is a more appropriate measure of center to use for strongly skewed data.

Even though we wouldn't characterize either of these data sets as strongly skewed, this affect is still visible. Here are both distributions with the means plotted for each.

Notice that the long left whisker in the Colorado data causes the mean to be pulled toward the left, making it lower than the median. In the Arizona plot, you can see that the mean is slightly higher than the median, due to the slightly elongated right side of the box. If these data sets were perfectly symmetric, the mean would be equal to the median in each case.

### Guided Practice

Given the following five number summary:

Median: 176

Quartiles: 154 189

Extremes: 122 224

a. Find the value of the range for these data.

b. About what percent of the data is in the interval 154 to 189?

c. Draw a box and whisker plot for this data.

Solutions:

a. The range for these data is 224 – 122 = 102.

b. The interval 154 to 189 is the interval between the first and third quartiles. There is always 50% of the data between these two quartiles.

c. Here is the box-and-whisker plot:

### Explore More

For 1-4, here are the 1998 data on the percentage of capacity of reservoirs in Idaho.

$70, 84, 62, 80, 75, 95, 69, 48, 76, 70, 45, 83, 58, 75, 85, 70,\\62, 64, 39, 68, 67, 35, 55, 93, 51, 67, 86, 58, 49, 47, 42, 75$

1. Find the five-number summary for this data set.
2. Show all work to determine if there are true outliers according to the $1.5*IQR$ rule.
3. Describe the shape, center, and spread of the distribution of reservoir capacities in Idaho in 1998.
4. Based on your answer in part (3), how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.

For 5-8, here are the 1998 data on the percentage of capacity of reservoirs in Utah.

$80, 46, 83, 75, 83, 90, 90, 72, 77, 4, 83, 105, 63,\\ 87, 73, 84, 0, 70, 65, 96, 89, 78, 99, 104, 83, 81$

1. Find the five-number summary for this data set.
2. Show all work to determine if there are true outliers according to the $1.5*IQR$ rule.
3. Describe the shape, center, and spread of the distribution of reservoir capacities in Utah in 1998.
4. Based on your answer in part (3) how would you expect the mean to compare to the median? Calculate the mean to verify your expectation.
5. Graph the box plots for Idaho and Utah on the same axes. Write a few statements comparing the water levels in Idaho and Utah by discussing the shape, center, and spread of the distributions.

### Vocabulary Language: English

arithmetic mean

arithmetic mean

The arithmetic mean is also called the average.
back-to-back stem plots

back-to-back stem plots

A Back-to-Back stem plot is a modified stem-and-leaf plot with the stem in the center and the leaves on the sides, it is used to compare two different related sets of data (bivariate data).
bell shaped

bell shaped

A bell shaped histogram is a histogram with a prominent ‘mound’ in the center and similar tapering to the left and right.
bins

bins

Bins are groups of data plotted on the x-axis.
bivariate data

bivariate data

Bivariate data consists of two paired sets of data.
box- and- whisker plot

box- and- whisker plot

A box- and- whisker plot is a graphic display of quantitative data that demonstrates the five number summary.
calculated data

calculated data

Calculated data has values that are the result of computations performed on the input variable.
dependent variable

dependent variable

The dependent variable is the output variable in an equation or function, commonly represented by $y$ or $f(x)$.
explanatory variables

explanatory variables

Explanatory variables are another name for independent variables.
extreme outliers

extreme outliers

Extreme outliers include points more than 3 times the middle half of your data.      .
Extremes

Extremes

The extremes are the maximum and minimum values in a data set.
five point summary

five point summary

The numbers needed to construct a box-and-whisker plot are called the five-point-summary. The five points are the minimum, the lower median (Q1), the median, the upper median (Q3), and the maximum.
independent variable

independent variable

The independent variable is the input variable in an equation or function, commonly represented by $x$.
input variables

input variables

Input variables are another name for independent variables.
Interquartile range

Interquartile range

The interquartile range is the difference between the third quartile and the first quartile (Q3-Q1).
Leaf

Leaf

The leaves of a stem-and-leaf plot are the rightmost digits of each of the original data values.
line of best fit

line of best fit

A line of best fit is a straight line drawn on a scatter plot such that the sums of the distances to the points on either side of the line are approximately equal and such that there are an equal number of points above and below the line.
line of fit

line of fit

A line of fit is a straight or continuously curved line representing the trend of changes in the comparison of two data sets (or one set of bivariate data).
linear regression

linear regression

In statistics, linear regression is a process that attempts to model the relationship between two variables by fitting a linear equation to the data.
lower median

lower median

The lower median is the first quartile (Q1) in the box-and-whisker plot.
Median

Median

The median of a data set is the middle value of an organized data set.
mild outliers

mild outliers

Mild outliers include data points that are more than 1.5 times the middle half of your data above the upper, or below the lower, quartiles.
modified box-plot

modified box-plot

A modified box plot has whiskers that extend to the highest and lowest non-outlier value.
normal distributed

normal distributed

If data is normally distributed, the data set creates a symmetric histogram that looks like a bell.
observed data

observed data

Observed data are the values that result from computations performed on the input variable.
Outlier

Outlier

In statistics, an outlier is a data value that is far from other data values.
output variables

output variables

Output variables are another name for dependent variables.
Quartile

Quartile

A quartile is each of four equal groups that a data set can be divided into.
range

range

The range of a set of data is the difference in value between the least and greatest values in the set.
response variables

response variables

Response variables are another name for dependent variables.
skewed

skewed

As with the horizontal skewing of a histogram, stem plots with a obvious skew toward one end or the other tend to indicate an increased number of outliers either lesser than or greater than the mode.
statistical correlation

statistical correlation

Statistical correlation is a representation of possible related changes in values between the two sets of data.
stem

stem

A stem  in a stem plot is a values or column of values that represent the greatest place value(s) in a set of data.
Stem-and-leaf plot

Stem-and-leaf plot

A stem-and-leaf plot is a way of organizing data values from least to greatest using place value. Usually, the last digit of each data value becomes the "leaf" and the other digits become the "stem".
trends

trends

Trends in data sets or samples are indicators found by reviewing the data from a general or overall standpoint
uniform

uniform

A uniform shaped histogram indicates data that is very consistent; the frequency of each class is very similar to that of the others.
upper median

upper median

The upper median is the third quartile (Q3) in the box-and-whisker plot.