In this Concept, you will learn about the effects of outliers and changing units on box-and-whisker plots.
For a description of how to draw a box-and-whisker plot from given data (14.0) , see patrickJMT, Box and Whisker Plot (5:53).
Here is some data for reservoirs in California (the names of the lakes and reservoirs have been omitted):
80, 83, 77, 95, 85, 74, 34, 68, 90, 82, 75
At first glance, the 34 should stand out. It appears as if this point is significantly different from the rest of the data. What effect does this one point have on a box-and-whisker plot?
Use a graphing calculator to investigate the box-and-whisker plot for the California reservoir data.
Enter your data into a list as we have done before, and then choose a plot. Under 'Type', you will notice what looks like two different box and whisker plots. For now choose the second one (even though it appears on the second line, you must press the right arrow to select these plots).
Setting a window is not as important for a box plot, so we will use the calculator's ability to automatically scale a window to our data by pressing [ZOOM] and selecting '9:Zoom Stat'.
Outliers in Box-and-Whisker Plots
While box plots give us a nice summary of the important features of a distribution, we lose the ability to identify individual points. The left whisker is elongated, but if we did not have the data, we would not know if all the points in that section of the data were spread out, or if it were just the result of the one outlier. It is more typical to use a modified box plot. This box plot will show an outlier as a single, disconnected point and will stop the whisker at the previous point.
Make a modified box plot for the California reservoir data.
Go back and change your plot to the first box plot option, which is the modified box plot, and then graph it.
Notice that without the outlier, the distribution is really roughly symmetric.
The California reservoir data set had one obvious outlier, but when is a point far enough away to be called an outlier? We need a standard accepted practice for defining an outlier in a box plot. This rather arbitrary definition is that any point that is more than 1.5 times the interquartile range will be considered an outlier. Because the is the same as the length of the box, any point that is more than one-and-a-half box lengths from either quartile is plotted as an outlier.
A common misconception of students is that you stop the whisker at this boundary line. In fact, the last point on the whisker that is not an outlier is where the whisker stops.
Determine whether there are any outliers for the California reservoir data.
The calculations for determining the outlier in this case are as follows:
Lower Quartile: 74
Upper Quartile: 85
Cut-off for outliers in left whisker: . Thus, any value less than 57.5 is considered an outlier.
Notice that we did not even bother to test the calculation on the right whisker, because it should be obvious from a quick visual inspection that there are no points that are farther than even one box length away from the upper quartile.
If you press [TRACE] and use the left or right arrows, the calculator will trace the values of the five-number summary, as well as the outlier.
There is only one outlier, and that is the data point 34.
The Effects of Changing Units on Shape, Center, and Spread
In a previous Concept, we looked at data for the materials in a typical desktop computer.
|Other elements and chemicals||6.44|
Here is the data set given in pounds. The weight of each in kilograms was multiplied by 2.2.
|Other elements and chemicals||14.2|
What effect does this conversion from kilograms to pounds have on some of the statistics we use to summarize data?
Determine the effect of the conversion from kilograms to pounds on the mean, standard deviation and box plots.
When all values are multiplied by a factor of 2.2, the calculation of the mean is also multiplied by 2.2, so the center of the distribution would be increased by the same factor. Similarly, calculations of the range, interquartile range, and standard deviation will also be increased by the same factor. In other words, the center and the measures of spread will increase proportionally.
Note: This is easier to convince yourself when you are working with actual numbers. Suppose that your mean is 20, and that two of the data values in your distribution are 21 and 23. If you multiply 21 and 23 by 2, you get 42 and 46, and your mean also changes by a factor of 2 and is now 40. Before your deviations were and , but now, your deviations are and , so your deviations are getting twice as big as well.
Since each number in the data set is doubled, the five-number summary is doubled, which makes the values in the box plot doubled. This results in the graph maintaining the same shape, but being stretched out, or elongated. Here are the side-by-side box plots for both distributions showing the effects of changing units.
On the Web
http://tinyurl.com/3ao9px More investigation of boxplots.
While an outlier is simply a point that is not typical of the rest of the data, there is an accepted definition of an outlier in the context of a box-and-whisker plot . Any point that is more than 1.5 times the length of the box from either end of the box is considered to be an outlier. When changing the units of a distribution, the center and spread will be affected, but the shape will stay the same.
Given the following data set:
111, 122, 133, 149, 126, 117, 101, 121
a. Find the median value for the data set.
b. Find the values of the upper and lower quartiles.
c. Find the value of the interquartile range (IQR).
d. Identify any outliers in the dataset.
e. Draw a box and whisker plot for this data.
a. To find the median, put the data in order and find the middle data point. That is, find the data point that has 50% of the data below it and 50% of the data above it. The data in order: 101, 111, 117, 121, 122, 126, 133, 149. There are 8 data points. The median would be between the 4th and 5th data points. In this case, the median is 121.5. Note that the median does not have to be a data point.
b. The lower quartile is the lower fourth of the data and the upper quartile separates the upper fourth of the data from the lower 75% of the data. In this data set the lower quartile is 114 and the upper quartile is 128.5
c. The interquartile range (IQR) is 128.5 – 114 = 14.5
d. Use the 1.5IQR rule: 1.5*IQR = 21.75. 128.5 + 21.75 = 150.25. Any value greater than 150.25 would be an outlier. There are no such values in this data set. 114 – 21.75 = 92.25. Any value less than 92.25 would be considered an outlier. There are no such values in this dataset.
For 1-7, use the table below, which contains recent data on the average price of a gallon of gasoline for states that share a border crossing into Canada.
- Find the five-number summary for this data.
- Show all work to test for outliers.
- Graph the box-and-whisker plot for this data.
- Canadian gasoline is sold in liters. Suppose a Canadian crossed the border into one of these states and wanted to compare the cost of gasoline. There are 3.7854 liters in a gallon. If we were to convert the distribution to liters, describe the resulting shape, center, and spread of the new distribution.
- Complete the following table. Convert to cost per liter by dividing by 3.7854, and then graph the resulting box plot.
- Look up the current data and compare that distribution with the data presented here.
- Find the exchange rate for Canadian dollars and convert the prices into American dollars.
|State||Average Price of a Gallon of Gasoline (US$)||Average Price of a Liter of Gasoline (US$)|
Average Prices of a Gallon of Gasoline on March 16, 2008
Figure: Average prices of a gallon of gasoline on March 16, 2008. Source: AAA, http://fuelgaugereport.opisnet.com/sbsavg.html
- What characteristics of a data set make it easier or harder to represent it using dot plots, stem-and-leaf plots, histograms, and box-and-whisker plots?
- Which plots are most useful to interpret the ideas of shape, center, and spread?
- What effects do other transformations of the data have on the shape, center, and spread?
If the median of a distribution is less than the mean, which of the following statements is the most correct?
- The distribution is skewed left.
- The distribution is skewed right.
- There are outliers on the left side.
- There are outliers on the right side.
- (b) or (d) could be true.
Given the following data set: 111, 122, 133, 149, 126, 117, 101, 121
- Find the median value for the data set.
- Find the values of the upper and lower quartiles.
- Find the value of the interquartile range (IQR).
- Identify any outliers in the dataset.
- Draw a box and whisker plot for this data.