# 7.3: Box-and-Whisker Plots

**Learning Objectives**

- Construct and interpret a box-and-whisker plot.
- Use technology to create box-and-whisker plots.

**Box-and Whisker Plots**

In traditional statistics, data is organized by using a frequency distribution. The results of the frequency distribution can then be used to create various graphs, such as a histogram or a frequency polygon, which indicate the shape or nature of the distribution. The shape of the distribution will allow you to confirm various conjectures about the nature of the data.

To examine data in order to identify patterns, trends, or relationships, exploratory data analysis is used. In exploratory data analysis, organized data is displayed in order to make decisions or suggestions regarding further actions. A **box-and-whisker plot** (often called a box plot) can be used to graphically represent the data set, and the graph involves plotting 5 specific values. The 5 specific values are often referred to as a **five-number summary** of the organized data set. The five-number summary consists of the following:

- The lowest number in the data set (minimum value)
- The median of the lower quartile: (median of the first half of the data set)
- The median of the entire data set (median)
- The median of the upper quartile: (median of the second half of the data set)
- The highest number in the data set (maximum value)

The display of the five-number summary produces a box-and-whisker plot as shown below:

The above model of a box-and-whisker plot shows 2 horizontal lines (the whiskers) that each contain 25% of the data and are of the same length. In addition, it shows that the median of the data set is in the middle of the box, which contains 50% of the data. The lengths of the whiskers and the location of the median with respect to the center of the box are used to describe the distribution of the data. It's important to note that this is just an example. Not all box-and-whisker plots have the median in the middle of the box and whiskers of the same size.

Information about the data set that can be determined from the box-and-whisker plot with respect to the location of the median includes the following:

a) If the median is located in the center or near the center of the box, the distribution is approximately symmetric.

b) If the median is located to the left of the center of the box, the distribution is positively skewed.

c) If the median is located to the right of the center of the box, the distribution is negatively skewed.

Information about the data set that can be determined from the box-and-whisker plot with respect to the length of the whiskers includes the following:

a) If the whiskers are the same or almost the same length, the distribution is approximately symmetric.

b) If the right whisker is longer than the left whisker, the distribution is positively skewed.

c) If the left whisker is longer than the right whisker, the distribution is negatively skewed.

The length of the whiskers also gives you information about how spread out the data is.

A box-and-whisker plot is often used when the number of data values is large. The center of the distribution, the nature of the distribution, and the range of the data are very obvious from the graph. The five-number summary divides the data into quarters by use of the medians of the upper and lower halves of the data. Remember that, unlike the mean, the median of the entire data set is not affected by outliers, so it is the measure of central tendency that is most often used in exploratory data analysis.

*Example 24*

For the following data sets, determine the five-number summaries:

a) 12, 16, 36, 10, 31, 23, 58

b) 144, 240, 153, 629, 540, 300

*Solution:*

a) The first step is to organize the values in the data set as shown below:

Now complete the following list:

Minimum value

Median

Maximum value

b) The first step is to organize the values in the data set as shown below:

Now complete the following list:

Minimum value

Median

Maximum value

*Example 25*

Use the data set for Example 24 part a) and the five-number summary to construct a box-and-whisker plot to model the data set.

*Solution:*

The five-number summary can now be used to construct a box-and-whisker plot. Be sure to provide a scale on the number line that includes the range from the minimum value to the maximum value.

a) Minimum value

Median

Maximum value

It is very visible that the right whisker is much longer than the left whisker. This indicates that the distribution is positively skewed.

*Example 26*

For each box-and-whisker plot, list the five-number summary and describe the distribution based on the location of the median.

*Solution:*

a) Minimum value

Median

Maximum value

The median of the data set is located to the right of the center of the box, which indicates that the distribution is negatively skewed.

b) Minimum value

Median

Maximum value

The median of the data set is located to the right of the center of the box, which indicates that the distribution is negatively skewed.

c) Minimum value

Median

Maximum value

The median of the data set is located to the left of the center of the box, which indicates that the distribution is positively skewed.

*Example 27*

The numbers of square feet (in 100s) of 10 of the largest museums in the world are shown below:

650, 547, 204, 213, 343, 288, 222, 250, 287, 269

Construct a box-and-whisker plot for the above data set and describe the distribution.

*Solution:*

The first step is to organize the data values as follows:

Now calculate the median, , and .

Next, complete the following list:

Minimum value

Median

Maximum value

The right whisker is longer than the left whisker, which indicates that the distribution is positively skewed.

The TI-83 or TI-84 can also be used to create a box-and whisker plot. In the following examples, the TI-83 is used. In the next chapter, key strokes using the TI-84 will be presented to you. The five-number summary values can be determined by using the TRACE feature of the calculator or by using CALC and 1-Var Stats.

*Example 28*

The following numbers represent the number of siblings in each family for 15 randomly selected students:

Use technology to construct a box-and-whisker plot to display the data. List the five-number summary values.

*Solution:*

Note that when creating a box-and-whisker plot with a TI calculator, you don't have to actually sort the data. The calculator will sort the data automatically when creating the box-and-whisker plot.

The five–number summary can be obtained from the calculator in 2 ways.

1. The following results are obtained by simply using the TRACE feature and the left and right arrows:

The values at the bottom of each screen are the five-number summary.

2. The second method involves pressing and using 1-Var Stats on the CALC menu for L1:

Many data sets contain values that are either extremely high values or extremely low values compared to the rest of the data values. These values are called **outliers**. There are several reasons why a data set may contain an outlier. Some of these are listed below:

- The value may be the result of an error made in measurement or in observation. The researcher may have measured the variable incorrectly.
- The value may simply be an error made by the researcher in recording the value. The value may have been written or typed incorrectly.
- The value could be a result obtained from a subject not within the defined population. A researcher recording marks from a math 12 examination may have recorded a mark by a student in grade 11 who was taking math 12.
- The value could be one that is legitimate but is extreme compared to the other values in the data set. (This rarely occurs, but it is a possibility.)

If an outlier is present because of an error in measurement, observation, or recording, then either the error should be corrected, or the outlier should be omitted from the data set. If the outlier is a legitimate value, then the statistician must make a decision as to whether or not to include it in the set of data values. There is no rule that tells you what to do with an outlier in this case.

One method for checking a data set for the presence of an outlier is to follow the procedure below:

- Organize the given data set and determine the values of and .
- Calculate the difference between and . This difference is called the
**interquartile range (IQR)**: . - Multiply the difference by 1.5, subtract this result from , and add it to .
- The results from Step 3 will be the range into which all values of the data set should fit. Any values that are below or above this range are considered outliers.

*Example 29*

Using the procedure outlined above, check the following data sets for outliers:

a) 18, 20, 24, 21, 5, 23, 19, 22

b) 13, 15, 19, 14, 26, 17, 12, 42, 18

*Solution:*

a) Organize the given data set as follows:

Determine the values for and .

Calculate the difference between and : .

Multiply this difference by 1.5: .

Finally, compute the range.

.

Are there any data values below 12.5? Yes, the value of 5 is below 12.5 and is, therefore, an outlier.

Are there any values above 28.5? No, there are no values above 28.5.

b) Organize the given data set as follows:

Determine the values for and .

Calculate the difference between and : .

Multiply this difference by 1.5: .

Finally, compute the range.

Are there any data values below 0? No, there are no values below 0.

Are there any values above 36.0? Yes, the value of 42 is above 36.0 and is, therefore, an outlier.

**Lesson Summary**

You have learned the significance of the median as it applies to dividing a set of data values into quartiles. You have also learned how to apply these values to the five-number summary needed to construct a box-and-whisker plot. In addition, you have learned how to construct a box-and-whisker plot and how to obtain the five-number summary by using technology. The last topic that you learned about in this lesson was the meaning of the term outlier. Some reasons why an outlier might exist in a data set and the procedure for determining whether or not a data set contains an outlier were also discussed.

**Points to Consider**

- Are there still other ways to represent data graphically?
- Are there other uses for a box-and-whisker plot?
- Can box-and-whisker plots be used for comparing data sets?

**Vocabulary**

- Bar graph
- A plot made of bars whose heights (vertical bars) or lengths (horizontal bars) represent the frequencies of each category.

- Bins
- Quantitative or qualitative categories. Bins are also known as classes.

- Box-and-whisker plot
- A graph of a data set in which the five-number summary is plotted. 50 percent of the data values are in the box, and the remaining 50 percent are divided equally on the whiskers.

- Broken-line graph
- A graph that is used when it is necessary to show change over time. A line is used to join the values, but the line has no defined slope.

- Continuous data
- Data for which the plotted points can be joined.

- Continuous variable
- A variable that can assume all values between 2 consecutive values of a data set.

- Correlation
- A statistical method used to determine whether or not there is a linear relationship between 2 variables.

- Data set
- A collection of observations of a variable.

- Dependent variable
- The variable represented by the values that are plotted on the -axis.

- Discrete data
- Data for which the plotted points cannot be joined.

- Discrete variable
- A variable that can only assume values that can be counted.

- Five-number summary
- 5 values for a data set that include the smallest value, the lower quartile, the median, the upper quartile, and the largest value.

- Frequency distribution
- A table that lists all of the classes and the number of data values that belong to each of the classes.

- Frequency polygon
- A graph that uses lines to join the midpoints of the tops of the bars of a histogram or to join the midpoints of the classes.

- Histogram
- A graph in which the classes, or bins, are on the horizontal axis and the frequencies are plotted on the vertical axis. The frequencies are represented by vertical bars that are drawn adjacent to each other.

- Independent variable
- The variable represented by the values that are plotted on the -axis.

- Interquartile range (IQR)
- The difference between the third quartile and the first quartile.

- Left-skewed distribution
- A distribution in which most of the data values are located to the right of the mean.

- Line of best fit
- A straight line drawn on a scatter plot such that the sums of the distances to points on either side of the line are approximately equal and such that there are an equal number of points above and below the line.

- Midpoint
- The value obtained by adding the lower and upper limits of a class and dividing the sum by 2.

- Outliers
- Extremely high values or extremely low values compared to the rest of the data values.

- Pie chart
- A circle that is divided into sections (slices) according to the percentage of the frequencies in each class.

- Qualitative variable
- A variable that can be placed into specific categories according to some defined characteristic.

- Quantitative variable
- A variable that is numerical in nature and that can be ordered.

- Right-skewed distribution
- A distribution in which most of the data values are located to the left of the mean.

- Scatter plot
- A graph used to investigate whether or not there is a relationship between 2 sets of data. The data is plotted on a graph such that one quantity is plotted on the -axis and one quantity is plotted on the -axis.

- Stem-and-leaf plot
- A method of organizing data that includes sorting the data and graphing it at the same time. This type of graph uses the stem as the leading part of the data value and the leaf as the remaining part of the value.

- Symmetric histogram
- A histogram for which the values of the mean, median, and mode are all the same and are all located at the center of the distribution.

- Variable
- A characteristic that is being studied.