5.5: Numerical Data: Box Plots & Outliers
Learning Objectives
- Calculate the five number summary for a set of numerical data
- Construct box plots
- Calculate IQR and standard deviation for a set of numerical data
- Determine which numerical summary is more appropriate for a given distribution
- Determine whether or not any values are outliers based on the 1.5*(IQR) criterion
- Describe distributions in context-- including shape, outliers, center, and spread
Box Plots
A box plot (also called box-and-whisker plot) is another type of graph used to display data. A box plot divides a set of numerical data into quarters. It shows how the data are dispersed around a median, but does not show specific values in the data. It does not show a distribution in as much detail as does a stem plot or a histogram, but it clearly shows where the data is located. This type of graph is often used when the number of data values is large or when two or more data sets are being compared. The center and spread of the distribution are very obvious from the graph. It is easy to see the range of the values as well as how these values are distributed around the middle value. The smaller the box, the more consistent the data values are with the median of the data. The shape of the box plot will give you a general idea of the shape of the distribution, but a histogram or stem plot will I do this more accurately. Any outliers will show up as long whiskers.The box in the box plot contains the middle 50% of the data, and each 'whisker' contains 25% of the data.
The Five Number Summary
In order to divide into fourths, it is necessary to find five numbers. This list of five values is called the five number summary. The numbers in the list are {minimum value, Quartile 1, Median, Quartile 3, maximum value}. We have already learned how to find the median of a set of numbers (put in order and find the middle value), and the minimum and maximum are the smallest and largest numbers. Now we will learn how to find the quartiles.
Quartiles
The first step is to list all of the numbers in order from least to greatest. The minimum and maximum are now on the ends of the list and we can count in to find the median--circle these three values. Finding the quartiles is just like finding the median. Quartile 1 is the 'median' of all of the values to the left of the median (do NOT include the median itself). Quartile 3 is the 'median' of all of the values to the right of the median (do not include the median).
Constructing a Box Plot
Now list the five number summary in order {min, Q1, Med, Q3, max). The next step is to mark an axis that covers the entire range of the data. Mark the numbers along the axis before you make the box plot, so that the resulting plot shows the shape of the data. The last step is to place a dot above the axis for the 5 numbers from the five number summary, and then to make a 'box' through the second and fourth dots, mark a line through the middle dot to show the median, and mark 'whiskers' from the box out to the first and fifth dots.
Example 1
You have a summer job working at Paddy’s Pond which is a recreational fishing spot where children can go to catch salmon which have been raised in a nearby fish hatchery and then transferred into the pond. The cost of fishing depends upon the length of the fish caught ($0.75 per inch). Your job is to transfer 15 fish into the pond three times a day. But, before the fish are transferred, you must measure the length of each one and record the results. Below are the lengths (in inches) of the first 15 fish you transferred to the pond. Calculate the five number summary, and construct a box plot for the lengths of these fish.
Solution
Since box plots are based on the median and quartiles, the first step is to organize the data in order from smallest to largest.
6 7 8 9 1010 11 13 13 1314 15 15 17 21
6,7,8,9,10,10,11,{\color{blue}13},13,13,14,15,15,17,21
The minimum is the smallest number (min = 6), and the maximum is the largest number (max = 21). Next, we need to find the median. This has an odd number of data, so the median of all the data is the value in the middle position (Med = 13). There are 7 numbers before and 7 numbers after 13. The next step is the find the median of the first half of the data – the 7 numbers before the median, but not including the median. This is called the lower quartile since it marks the point above the first quarter of the data. On the graphing calculator this value is referred to as
Q1 .
6,7,8,{\color{blue}9},10,10,11
Quartile 1 is the median of the lower half of the data (Q_{1} = 9).
This step must be repeated for the upper half of the data – the 7 numbers above the median of 13. This is called the upper quartile since it is the point that marks the third quarter of the data. On the graphing calculator this value is referred to as
Q3 .
13,13,14,{\color{blue}15},15,17,21 Quartile 3 is the median of the upper half of the data (Q_{3} = 15).
Now that the five numbers have all been determined, it is time to construct the actual graph. The graph is drawn above a number line that includes all the values in the data set (graph paper works very well since the numbers can be placed evenly using the lines of the graph paper). For this examle we will need to mark from at least 6 to at least 21. Be sure to mark your axis before you start to construct the box plot. Next, represent the following values by placing dots above their corresponding values on the number line:
Minimum−6Quartile 3−15Quartile 1−9Maximum−21Median−13
The five data values listed above are often called the five number summary for the data set and are necessary to graph every box plot.
Make the 'box' part around the Q_{1} and Q_{3} values, make 'whiskers' out to the min and max values, and make a vertical line to show the location of the median. This will complete the box plot.
Length of fish (in inches) 5# summary = {6, 9, 13, 15, 21}
The five numbers divide the data into four equal parts. In other words:
- One-quarter of the data values are located between 6 and 9
- One-quarter of the data values are located between 9 and 13
- One-quarter of the data values are located between 13 and 15
- One-quarter of the data values are located between 15 and 21
.
More Measures of Spread
Range
We have already learned how to find the range of a set of data. The range represents the entire spread of all of the data.
The formula for calculating the range is:
max - min = range
Inner Quartile Range
The quartiles give us one more measure of spread called the inner quartile range. The inner quartile range (IQR) is the range between the lower and upper quartile. To find the IQR, subtract the quartile 1 value from the quartile 3 value (Q_{3} - Q_{1} = IQR). The IQR represents the spread, or range, of the middle 50% of the data. The IQR is a measure of spread that is used when the median is the measure of central tendency.
The formula for calculating the IQR is:
Q_{3} - Q_{1} = IQR
Standard Deviation
Another measure of spread that is used in statistics is called the standard deviation. The standard deviation measures the spread around the mean. This value is more difficult to calculate than range or IQR, but the formula used takes all of the data values in the distribution into account. Standard deviation is the appropriate measure of spread when the mean is the measure of center. However, the standard deviation is easily affected by outliers or skewness because every value is calculated in the formula. The symbol for standard deviation of a sample is s (on the graphing calculators it is S_{x)} and for a population it is σ (sigma).
The standard deviation can be any number zero or greater. It will only be equal to zero if there is no spread (i.e. all values are exactly the same). The more spread out the data is, the larger the standard deviation will be. The standard deviation is most appropriate when you have a very symmetrical, bell-shaped distribution called a normal distribution. We will study this type of distribution in chapter 7.
Which Numerical Summary Should We Use?
We have learned several statistics that are measures of central tendency and several that are measures of spread. How do we know which ones to use? The mean and standard deviation go together. And, the median will go with the IQR (or range). The most important thing to remember is that the mean and the standard deviation are both affected by outliers and by skewness in a distribution. So if either of these is present, then the mean and standard deviation are not appropriate. However, it is always an option, and often interesting to calculate all of the statistics and compare them to one another. The general guidelines are:
How to Calculate the Standard Deviation With the Formula
In order to calculate the standard deviation you must have all of the values. Then you follow these steps:
- Calculate the mean of the values.
- Subtract the mean from each data value. These are the individual deviations.
- Each of these deviations is squared.
- All of the squared deviations are added up.
- This total of the squared deviations is divided by one less than the number of deviations. This is the variance.
- Take the square root of the variance. This is the standard deviation.
The formula for calculating the variance is:
s2=1n−1∑i=1n(x−x¯)2
The formula for calculating standard deviation is:
s=1n−1∑i=1n(x−x¯)2−−−−−−−−−−−−−−−√
As you can probably tell, this formula is very time consuming when you have a large set of data. Also, it is easy to make a mistake in your calculations. We will show the process with a small set of data, but generally we will use our calculator to find the standard deviation. See the appendix for the calculator instructions on how to do this.
Example 2
There are five teenage girls on Buhl street that the Miller's often have babysit their three rambunctious sons. There ages are 12, 15, 14, 17, and 19 years old. Find the mean and standard deviation for the ages of the Miller's babysitters.
Solution
- Calculate the mean of the values.
(12+15+14+17+19)5=15.4 - Subtract the mean from each data value. These are the individual deviations.
- Each of these deviations is squared.
- All of the squared deviations are added up.
- This total of the squared deviations is divided by one less than the number of deviations. This is the variance.
- Take the square root of the variance. This is the standard deviation.
The mean age of the Miller family's babysitters is 15.4 years old and the standard deviation is 2.7019 years.
The standard deviation is tedious to calculate. For any problem where you are asked to calculate the standard deviation, you may use your calculator or a computer to find it.
Example 3
After one month of growing, the heights of 30 parsley seed plants were measured and recorded. The measurements (in inches) are shown in the table below.
22 | 28 | 30 | 40 | 38 | 18 |
11 | 37 | 12 |
34 |
49 | 17 |
25 | 37 | 46 | 39 | 8 | 27 |
16 | 38 | 18 | 23 | 26 | 14 |
6 | 26 | 23 | 33 | 11 | 26 |
a) Calculate the five number summary and construct a box plot to represent the data.
b) Describe the distribution.
c) Calculate the mean and standard deviation.
d) Calculate the median, and IQR
Solution
a) five number summary and box plot:
order the values-- The data organized from smallest to largest is shown in the table below. (You could use your calculator to quickly sort these values)
Heights of Parsley (in.) 6 8 11 11 12 14 16 17 18 18 22 23 23 25 26 26 26 27 28 30 33 34 37 37 38 38 39 40 46 49
5# summary-- This time there is an even number of data values so the median will be the mean of the two middle values.
Med=26+262=26 (We will not use the median, but we do use the values on either side of it when finding quartiles). The median of the lower half is the number in the 8th position which is 17. The median of the upper half is the number in the 22nd position (or 8th from the top) which is 37. The smallest number is 6 and the largest number is 49.5# summary = {6, 17, 26, 37, 49} (all are inches)
b) describe--don't forget your S.O.C.C.S!
The heights of these parsley plants ranged from 6 inches to 49 inches after one month. The distribution is very symmetrical and does not contain any outliers. The median height for these parsley plants was 26 inches tall. The middle 50% of the plants were all between 17 inches and 37 inches tall.
c) The mean and standard deviation were calculated using the TI-84+.
x¯=25.9333 inches
s=11.4709 inchesd) The median is part of the five number summary. The IQR = Q_{3} - Q_{1} = 37 - 17 = 20
Med=26 inches
IQR=20 inches
Outliers
We have been noticing some values that appear to be outliers, but have not defined a specific distance to be considered an outlier. The common outlier test, used to determine whether or not any of the values are outliers uses the IQR. This outlier test, often called the 1.5*(IQR) Criterion, says that any value that is more than one and one-half times the width of the IQR box away from the box is an outlier.
Example 4
Test the sodium in the McDonald's® sandwiches for outliers. The data can be found in Section 5.5 Exercises, problem #1. Use the 1.5*(IQR) Criterion. Show your steps.
Solution
Calculate the five number summary for the Amount of Sodium (in mg)
fivenumbersummary={520,835,1095,1285,2070}
First find the IQR:
IQR=1285−835=450 Test for low outliers:
Q1−1.5(IQR)
835−1.5(450)=160 Test for high outliers:
Q3+1.5(IQR)
1285+1.5(450)=1960 Check the data to see if we have any outliers:
We have no sandwiches with less than 160 mg sodium, so we have no low outliers.
We have one value that is greater than this cutoff of 1960 mg. The Angus Bacon & Cheese burger has 2070 mg of sodium, so we have one high outlier.
Problem Set 5.5
Section 5.5 Exercises
1) Here is some nutritional information about a few of the sandwiches on the McDonald's® menu.
Source: http://nutrition.mcdonalds.com. July 27, 2011.
Determine the median and the IQR for the following data regarding the McDonald's® sandwiches:
a) Calories from fat
b) Cholesterol
2) Analyze the calories for these McDonald's® sandwiches.
a) Calculate the five number summary and construct an accurate box plot for the calories for these sandwiches.
b) Use the outlier test to determine whether there are any outliers for calories. Test for both high and low outliers. Show your steps.
c) Describe the distribution in context- Remember your S.O.C.C.S!
3) Analyze the sodium content further.
a) Construct a box plot for sodium.
b) Calculate the median and IQR for sodium (see example 4).
c) Calculate the mean and standard deviation for sodium (use a calculator).
Now remove the high outlier from the data.
d) Re-calculate the median and IQR for sodium with the Angus Bacon & Cheese data removed. Did either value change from part (b)?
e) Re-calculate the mean and standard deviation for sodium with the Angus Bacon & Cheese data removed. Did either value change from part (c)?
4) The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials.
Manufactured Material | Energy Saved (millions of BTU's per ton) |
Aluminum Cans | 206 |
Copper Wire | 83 |
Steel Cans | 20 |
LDPE Plastics (e.g. trash bags) | 56 |
PET Plastics (e.g. beverage bottles) | 53 |
HDPE Plastics (e.g. household cleaner bottles) | 51 |
Personal Computers | 43 |
Carpet | 106 |
Glass | 2 |
Corrugated Cardboard | 15 |
Newspaper | 16 |
Phone Books | 11 |
Magazines | 11 |
Office Paper | 10 |
Source: National Geographic, January 2008. Volume 213 No., pg 82-
a) Calculate the five number summary and construct an accurate box plot for the Energy Saved data.
b) Use the outlier test to determine whether there are any outliers. Show your steps.
c) Calculate the mean and standard deviation for the Energy Saved data. How do the mean and the median compare?
d) Delete any outliers. Recalculate the five number summary, mean and standard deviation. Which values changed?
5) The Burj Dubai is the world’s tallest building. It is more than twice the height of the Empire State Building in New York. The chart lists 15 of the tallest buildings in the world.
Building | City | Height (ft) |
Taipei 101 | Tapei | 1671 |
Shanghai World Financial Center | Shanghai | 1614 |
Petronas Tower | Kuala Lumpur | 1483 |
Sears Tower | Chicago | 1451 |
Jin Mao Tower | Shanghai | 1380 |
Two International Finance Center | Hong Kong | 1362 |
CITIC Plaza | Guangzhou | 1283 |
Shun Hing Square | Shenzen | 1260 |
Empire State Building | New York | 1250 |
Central Plaza | Hong Kong | 1227 |
Bank of China Tower | Hong Kong | 1205 |
Bank of America Tower | New York | 1200 |
Emirates Office Tower | Dubai | 1163 |
Tuntex Sky Tower | Kaohsiung | 1140 |
Burj Dubai | Dubai | 2717 |
a) Calculate the five number summary for these 15 buildings and construct an accurate box plot.
b) Use the outlier test to determine whether there are any outliers among these 15 buildings. Test for both high and low outliers. Show your steps.
c) Describe the shape of the distribution. Remember your S.O.C.C.S!
d) Within what range of heights are the middle 50% of these buildings?
6) The table shows the mean travel time to work (in minutes), for workers age 16+, for 16 cities in Minnesota. This is according to the U.S. Census website. Source: http://quickfacts.census.gov
a) Construct a box plot for the mean travel time for residents of these Minnesota cities.
b) Make a statement, in context, about what the 'box' part of the box plot tells you.
c) Describe the distribution. Remember your S.O.C.C.S! Identify any unusual values specifically.
7) Several game critics rated the Wow So Fit game, on a scale of 1 to 100 (100 being the highest). The results are presented in this stem plot:
a) Calculate the five number summary for the Wow So Fit data.
b) Construct a box plot for the data.
c) Describe this distribution.
d) Make a statement, in context, about what the "box" part of the box plot tells us.
Review Exercises
8) Read each of the criticisms below and determine whether the person making the statement is questioning the validity, the reliability, or the presence of bias in the test. Explain.
a) "The game critics get free copies of the games for their families. So, these ratings are inflated."
b) "The game critics have no set guidelines on which to use to critique the games. So, these ratings are meaningless."
c) "The game critics may give different ratings to the same game, when asked at different times. So, these ratings are inconsistent."
9) Construct a tree diagram that shows all possible outcomes, in regard to gender, of a family with three children.
10) Assuming that P(boy) = P(girl) = 0.5, find the following probabilities:
a) P(boy, girl, then boy)
b) P(exactly two girls)
c) P(at least one boy)