### Box-and-Whisker Plots

In traditional statistics, data is organized by using a frequency distribution. The results of the frequency distribution can then be used to create various graphs, such as a histogram or a frequency polygon, which indicate the shape or nature of the distribution. The shape of the distribution will allow you to confirm various conjectures about the nature of the data.

To examine data in order to identify patterns, trends, or relationships, exploratory data analysis is used. In exploratory data analysis, organized data is displayed in order to make decisions or suggestions regarding further actions. A **box-and-whisker plot** (often called a box plot) can be used to graphically represent the data set, and the graph involves plotting 5 specific values. The 5 specific values are often referred to as a **five-number summary** of the organized data set. The five-number summary consists of the following:

- The lowest number in the data set (minimum value)
- The median of the lower quartile: \begin{align*}Q_1\end{align*} (median of the first half of the data set)
- The median of the entire data set (median)
- The median of the upper quartile: \begin{align*}Q_3\end{align*} (median of the second half of the data set)
- The highest number in the data set (maximum value)

The display of the five-number summary produces a box-and-whisker plot as shown below:

The above model of a box-and-whisker plot shows 2 horizontal lines (the whiskers) that each contain 25% of the data and are of the same length. In addition, it shows that the median of the data set is in the middle of the box, which contains 50% of the data. The lengths of the whiskers and the location of the median with respect to the center of the box are used to describe the distribution of the data. It's important to note that this is just an example. Not all box-and-whisker plots have the median in the middle of the box and whiskers of the same size.

Information about the data set that can be determined from the box-and-whisker plot with respect to the location of the median includes the following:

a. If the median is located in the center or near the center of the box, the distribution is approximately symmetric.

b. If the median is located to the left of the center of the box, the distribution is positively skewed.

c. If the median is located to the right of the center of the box, the distribution is negatively skewed.

Information about the data set that can be determined from the box-and-whisker plot with respect to the length of the whiskers includes the following:

a. If the whiskers are the same or almost the same length, the distribution is approximately symmetric.

b. If the right whisker is longer than the left whisker, the distribution is positively skewed.

c. If the left whisker is longer than the right whisker, the distribution is negatively skewed.

The length of the whiskers also gives you information about how spread out the data is.

A box-and-whisker plot is often used when the number of data values is large. The center of the distribution, the nature of the distribution, and the range of the data are very obvious from the graph. The five-number summary divides the data into quarters by use of the medians of the upper and lower halves of the data. Many data sets contain values that are either extremely high values or extremely low values compared to the rest of the data values. These values are called **outliers**. There are several reasons why a data set may contain an outlier. Some of these are listed below:

- The value may be the result of an error made in measurement or in observation. The researcher may have measured the variable incorrectly.
- The value may simply be an error made by the researcher in recording the value. The value may have been written or typed incorrectly.
- The value could be a result obtained from a subject not within the defined population. A researcher recording marks from a math 12 examination may have recorded a mark by a student in grade 11 who was taking math 12.
- The value could be one that is legitimate but is extreme compared to the other values in the data set. (This rarely occurs, but it is a possibility.)

If an outlier is present because of an error in measurement, observation, or recording, then either the error should be corrected, or the outlier should be omitted from the data set. If the outlier is a legitimate value, then the statistician must make a decision as to whether or not to include it in the set of data values. There is no rule that tells you what to do with an outlier in this case.

One method for checking a data set for the presence of an outlier is to follow the procedure below:

- Organize the given data set and determine the values of \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}.
- Calculate the difference between \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}. This difference is called the
**interquartile range (IQR)**: \begin{align*}IQR = Q_3-Q_1\end{align*}. - Multiply the difference by 1.5, subtract this result from \begin{align*}Q_1\end{align*}, and add it to \begin{align*}Q_3\end{align*}.
- The results from Step 3 will be the range into which all values of the data set should fit. Any values that are below or above this range are considered outliers.

#### Listing a Five-Number Summary and Describing the Distribution

For each box-and-whisker plot, list the five-number summary and describe the distribution based on the location of the median.

a. Minimum value \begin{align*}\rightarrow 4\end{align*}

\begin{align*}Q_1 \rightarrow 6\end{align*}

Median \begin{align*}\rightarrow 9\end{align*}

\begin{align*}Q_3 \rightarrow 10\end{align*}

Maximum value \begin{align*}\rightarrow 12\end{align*}

The median of the data set is located to the right of the center of the box, which indicates that the distribution is negatively skewed.

b. Minimum value \begin{align*}\rightarrow 225\end{align*}

\begin{align*}Q_1 \rightarrow 250\end{align*}

Median \begin{align*}\rightarrow 300\end{align*}

\begin{align*}Q_3 \rightarrow 325\end{align*}

Maximum value \begin{align*}\rightarrow 350\end{align*}

The median of the data set is located to the right of the center of the box, which indicates that the distribution is negatively skewed.

c. Minimum value \begin{align*}\rightarrow 60\end{align*}

\begin{align*}Q_1 \rightarrow 70\end{align*}

Median \begin{align*}\rightarrow 75\end{align*}

\begin{align*}Q_3 \rightarrow 95\end{align*}

Maximum value \begin{align*}\rightarrow 100\end{align*}

The median of the data set is located to the left of the center of the box, which indicates that the distribution is positively skewed.

#### Constructing a Box-and-Whisker Plot

The numbers of square feet (in 100s) of 10 of the largest museums in the world are shown below:

650, 547, 204, 213, 343, 288, 222, 250, 287, 269

Construct a box-and-whisker plot for the above data set and describe the distribution.

The first step is to organize the data values as follows:

\begin{align*}20,400 \quad 21,300 \quad 22,200 \quad 25,000 \quad 26,900 \quad 28,700 \quad 28,800 \quad 34,300 \quad 54,700 \quad 65,000\end{align*}

Now calculate the median, \begin{align*}Q_1\end{align*}, and \begin{align*}Q_3\end{align*}.

\begin{align*}20,400 \quad 21,300 \quad 22,200 \quad 25,000 \quad \boxed{26,900 \quad 28,700} \quad 28,800 \quad 34,300 \quad 54,700 \quad 65,000\end{align*}

\begin{align*}\text{Median} \rightarrow \frac{26,900+28,700}{2} = \frac{55,600}{2} = 27, 800 \end{align*}

\begin{align*}Q_1 = 22,200\end{align*}

\begin{align*}Q_3 = 34,300\end{align*}

Next, complete the following list:

Minimum value \begin{align*}\rightarrow 20,400\end{align*}

\begin{align*}Q_1 \rightarrow 22,200\end{align*}

Median \begin{align*}\rightarrow 27,800\end{align*}

\begin{align*}Q_3 \rightarrow 34,300\end{align*}

Maximum value \begin{align*}\rightarrow 65,000\end{align*}

The right whisker is longer than the left whisker, which indicates that the distribution is positively skewed.

#### Checking for Outliers

Using the procedure outlined above, check the following data sets for outliers:

a. 18, 20, 24, 21, 5, 23, 19, 22

Organize the given data set as follows:

\begin{align*}& 18, \ 20, \ 24, \ 21, \ 5, \ 23, \ 19, \ 22\\ & 5, \ 18, \ 19, \ 20, \ 21, \ 22, \ 23, \ 24\end{align*}

Determine the values for \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}.

\begin{align*}5, \ \boxed{18, \ 19}, \ 20, \ 21, \ \boxed{22, \ 23}, \ 24\end{align*}

\begin{align*}Q_1 = \frac{18+19}{2} = \frac{37}{2}= 18.5 \qquad Q_3=\frac{22+23}{2}=\frac{45}{2}=22.5\end{align*}

Calculate the difference between \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}: \begin{align*}Q_3-Q_1=22.5-18.5=4.0.\end{align*}

Multiply this difference by 1.5: \begin{align*}(4.0)(1.5)=6.0\end{align*}.

Finally, compute the range.

\begin{align*}Q_1-6.0=18.5-6.0=12.5 \end{align*}

\begin{align*}Q_3+6.0=22.5+6.0=28.5.\end{align*}

Are there any data values below 12.5? Yes, the value of 5 is below 12.5 and is, therefore, an outlier.

Are there any values above 28.5? No, there are no values above 28.5.

b. 12, 15, 19, 14, 26, 17, 12, 42, 18

Organize the given data set as follows:

\begin{align*}& 13, \ 15, \ 19, \ 14, \ 26, \ 17, \ 12, \ 42, \ 18\\ & 12, \ 13, \ 14, \ 15, \ 17, \ 18, \ 19, \ 26, \ 42\end{align*}

Determine the values for \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}.

\begin{align*}12, \ \boxed{13, \ 14}, \ 15, \ \boxed{17}, \ 18, \ \boxed{19, \ 26}, \ 42\end{align*}

\begin{align*}Q_1=\frac{13+14}{2} = \frac{27}{2}=13.5 \qquad Q_3 = \frac{19+26}{2} = \frac{45}{2} = 22.5\end{align*}

Calculate the difference between \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}: \begin{align*}Q_3-Q_1=22.5-13.5=9.0.\end{align*}

Multiply this difference by 1.5: \begin{align*}(9.0)(1.5)=13.5\end{align*}.

Finally, compute the range.

\begin{align*}Q_1-13.5=13.5-13.5=0\end{align*}

\begin{align*}Q_3+13.5=22.5+13.5=36.0\end{align*}

Are there any data values below 0? No, there are no values below 0.

Are there any values above 36.0? Yes, the value of 42 is above 36.0 and is, therefore, an outlier.

**Points to Consider**

- Are there still other ways to represent data graphically?
- Are there other uses for a box-and-whisker plot?
- Can box-and-whisker plots be used for comparing data sets?

-->

### Examples

For the following data sets, determine the five-number summaries:

#### Example 1

12, 16, 36, 10, 31, 23, 58

The first step is to organize the values in the data set as shown below:

\begin{align*}&12, \ 16, \ 36, \ 10, \ 31, \ 23, \ 58\\ &10, \ 12, \ 16, \ 23, \ 31, \ 36, \ 58\end{align*}

Now complete the following list:

Minimum value \begin{align*}\rightarrow 10\end{align*}

\begin{align*}Q_1 \rightarrow 12\end{align*}

Median \begin{align*}\rightarrow 23\end{align*}

\begin{align*}Q_3 \rightarrow 36\end{align*}

Maximum value \begin{align*}\rightarrow 58\end{align*}

#### Example 2

144, 240, 153, 629, 540, 300

The first step is to organize the values in the data set as shown below:

\begin{align*}&144, \ 240, \ 153, \ 629, \ 540, \ 300\\ &144, \ 153, \ 240, \ 300, \ 540, \ 629\end{align*}

Now complete the following list:

Minimum value \begin{align*}\rightarrow 144\end{align*}

\begin{align*}Q_1 \rightarrow 153\end{align*}

Median \begin{align*}\rightarrow 270\end{align*}

\begin{align*}Q_3 \rightarrow 540\end{align*}

Maximum value \begin{align*}\rightarrow 629\end{align*}

#### Example 3

Use the data set from Example 1 and the five-number summary to construct a box-and-whisker plot to model the data set.

The five-number summary can now be used to construct a box-and-whisker plot for part i. Be sure to provide a scale on the number line that includes the range from the minimum value to the maximum value.

Minimum value \begin{align*}\rightarrow 10\end{align*}

\begin{align*}Q_1 \rightarrow 12\end{align*}

Median \begin{align*}\rightarrow 23\end{align*}

\begin{align*}Q_3 \rightarrow 36\end{align*}

Maximum value \begin{align*}\rightarrow 58\end{align*}

It is very visible that the right whisker is much longer than the left whisker. This indicates that the distribution is positively skewed.

### Review

- Which of the following is not a part of the five-number summary?
- \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}
- the mean
- the median
- minimum and maximum values

- What percent of the data is contained in the box of a box-and-whisker plot?
- 25%
- 100%
- 50%
- 75%

- What name is given to the horizontal lines to the left and right of the box of a box-and-whisker plot?
- axis
- whisker
- range
- plane

- What term describes the distribution of a data set if the median of the data set is located to the left of the center of the box in a box-and-whisker plot?
- positively skewed
- negatively skewed
- approximately symmetric
- not skewed

- What 2 values of the five-number summary are connected with 2 horizontal lines on a box-and-whisker plot?
- Minimum value and the median
- Maximum value and the median
- Minimum and maximum values
- \begin{align*}Q_1\end{align*} and \begin{align*}Q_3\end{align*}

- For the following data sets, determine the five-number summaries:
- 74, 69, 83, 79, 60, 75, 67, 71
- 6, 9, 3, 12, 11, 9, 15, 5, 7

- For each of the following box-and-whisker plots, list the five-number summary and comment on the distribution of the data:
- The following data represents the number of coins that 12 randomly selected people had in their piggy banks: \begin{align*}35 \quad 58 \quad 29 \quad 44 \quad 104 \quad 39 \quad 72 \quad 34 \quad 50 \quad 41 \quad 64 \quad 54\end{align*} Construct a box-and-whisker plot for the above data.
- The following data represent the time (in minutes) that each of 20 people waited in line at a local book store to purchase the latest Harry Potter book: \begin{align*}& 15 \quad 8 \quad 5 \quad \ 10 \quad 14 \quad 17 \quad 21 \quad 23 \quad 6 \quad 19 \quad 31 \quad 34 \quad 30 \quad 31\\ & 3 \quad 22 \quad 17 \quad 25 \quad 5 \quad 16\end{align*} Construct a box-and-whisker plot for the above data. Are the data skewed in any direction?
- Firman’s Fitness Factory is a new gym that offers reasonably-priced family packages. The following table represents the number of family packages sold during the opening month: \begin{align*}& 24 \quad 21 \quad 31 \quad 28 \quad 29\\ & 27 \quad 22 \quad 27 \quad 30 \quad 32\\ & 26 \quad 35 \quad 24 \quad 22 \quad 34\\ & 30 \quad 28 \quad 24 \quad 32 \quad 27\\ & 32 \quad 28 \quad 27 \quad 32 \quad 23\\ & 20 \quad 32 \quad 28 \quad 32 \quad 34\end{align*} Construct a box-and-whisker plot for the data. Are the data symmetric or skewed?
- Shown below is the number of new stage shows that appeared in Las Vegas for each of the past several years. Construct a box-and-whisker plot for the data and comment of the shape of the distribution. \begin{align*}31 \quad 29 \quad 34 \quad 30 \quad 38 \quad 40 \quad 36 \quad 38 \quad 32 \quad 39 \quad 35\end{align*}
- The following data represent the average snowfall (in centimeters) for 18 Canadian cities for the month of January. Construct a box-and-whisker plot to model the data. Is the data skewed? Justify your answer.

Name of City |
Amount of Snow(cm) |
---|---|

Calgary | 123.4 |

Charlottetown | 74.5 |

Edmonton | 80.6 |

Fredericton | 73.8 |

Halifax | 64.0 |

Labrador City | 110.4 |

Moncton | 82.4 |

Montreal | 63.6 |

Ottawa | 48.9 |

Quebec City | 53.8 |

Regina | 35.9 |

Saskatoon | 25.4 |

St. John’s | 97.5 |

Sydney | 44.2 |

Toronto | 21.8 |

Vancouver | 12.8 |

Victoria | 8.3 |

Winnipeg | 76.2 |

- Using the procedure outlined in this concept, check the following data sets for outliers:
- 25, 33, 55, 32, 17, 19, 15, 18, 21
- 149, 123, 126, 122, 129, 120

### Review (Answers)

To view the Review answers, open this PDF file and look for section 7.11.