- Construct parallel box plots
- Construct back-to-back stem plots
- Compare more than one set of numerical data in context
Parallel Box Plots
Parallel box plots (also called side-by-side box plots) are very useful when two or more numerical data sets need to be compared. The graphs of the parallel box plots are plotted, one parallel to the other, along the same number line. This can be done vertically or horizontally and for as many data sets as needed.
The figure shows the distributions of the temperatures for three different cities. By graphing the three box plots along the same axis, it becomes very easy to compare the temperatures of the three cities. What are some conclusions that can be drawn about the temperatures in these three cities?
Here are some conclusions, based on the graphs, that might be made. Think S.O.C.C.S! And, be sure to compare the distributions to one another, using statistics to support your observations.
Quartile 1 for City 2 is higher than the quartile 3 in City 1 and the median in City 3. Also, the minimum temperature in City 2 is at about the median for the other two cities.
City 2 is generally warmer than both of the other cities. Cities 1 and 3 have nearly the same median temperature, around 60o to 63o. Whereas, the median temperature in City 2 is around 82o.
City 3 has a much larger range in temperatures (35o to 85o), than City 1 (45o to 75o) or City 2 (62o to 95o). Thus, the temperature in City 1 is the most consistent of the three.
The temperature distributions in all three cities are fairly symmetrical and none have any outliers.
Comparing Numerical Data Sets
When you are given numerical sets of data for more than one variable and asked to compare them, it will be necessary to construct graphical representations for each data set. In order to compare them to one another the scales must match. When comparing more than one box plot, we construct parallel box plots. When using histograms, we can match the horizontal and vertical scales so that the separate histograms can 'line up'. Dot plots will work the same way as histograms. Such comparisons are also possible when working with stem plots. Two sets of numerical data can simply share the stems in the middle, with one set's 'leaves' going to the right and the other set's 'leaves' going to the left. On both sides of the plot, the 'leaves' will go in numerical order out. Plots like these are called back-to-back stem plots.
Once you have constructed any of these types of comparative graphical representations (on the same scale,) you can make observations about how the data sets are the same and how they are different. Just as we have been doing up to this point, those comparisons should be done in context. The observations made might address the shapes of the distributions and whether or not any outliers are present. It is important to compare the centers of the distributions (means, medians, or modes). And, the spreads of the distributions should also be addressed (ranges, IQRs, or standard deviations).
A teacher gave the same physics exam to her two sections of physics. She has been wondering whether the first period and fifth period classes are learning the same amount as one another. She constructed this back-to-back stem plot to compare the test scores for the two different classes.
a) Calculate the five number summary for both classes.
b) Calculate the mean and standard deviation for both classes.
c) Compare the two classes' test scores in context.
a) The numbers in the stem plots are already in order, so these statistics could be found by hand or with a graphing calculator.
b) These statistics are most efficiently found using a graphing calculator.
Overall, Class A did better on this test than Class B did. Class A's scores on this test are skewed to the left, but Class B's scores are skewed to the right. Neither class has any outliers among the test scores. Class A has a mean score of about 9 points higher (85.7 compared to 76.6) and a median score of 15 points higher (90.5 compared to 75.5). The overall range for the two classes is fairly similar, but the Class A students' scores were less consistent. The ranges (32 and 40), IQRs (14 and 19), and standard deviations (10.1 and 12.6), all show that Class B's scores are less spread out than Class A's scores.
An oil company claims that its premium grade gasoline contains an additive that significantly increases gas mileage. They conducted the following experiment in an effort to prove their claim. They selected 15 drivers who all drove the same make, model and year of car. Starting with an empty gas tank, each car was filled with 45L of one of the two types of gasoline (selected in a random order). The driver was asked to drive until the fuel light warning came on. The number of kilometers was recorded and then the car was filled with the other type of gasoline (whichever they had not used yet). The process was repeated and the number of kilometers was again recorded. The results below show the number of kilometers each car traveled.
Display each set of data to explain whether or not the claim made by the oil company is true or false.
order the data--list the values in order for each set of data
5 # summaries- Determine the five number summary for each set of data separately. Be sure to report your five number summary, whether asked to or not.
box plots --Mark your number axis so that it covers the entire range needed -- smallest minimum to largest maximum (we need 500 to 709 for these two data sets). Then graph each box plot along the same axis, but parallel to each other. This allows for the two data sets to be easily compared to one another.
Key: blue = regular gasoline
gold = premium gasoline
conclusions-- make comparisons by looking for any similarities and differences between the two distributions. Remember your S.O.C.C.S!
Based on this experiment, the number of kilometers that the cars were able to travel on the premium gasoline was greater than the number of kilometers that the same cars were able to travel with the regular gasoline. The median number of kilometers for premium gasoline was 637, compared to 587 for regular gas. The first quartile for premium was higher than the third quartile for regular. Also, 25% of those with the premium gasoline went further than all of those using regular gasoline. The distribution for the regular fuel is slightly skewed to the right, but doesn't have any outliers. However the premium distribution is strongly skewed to the left toward one outlier on the low end (500 km). Based on these results, it appears that the additive in the premium gasoline does improve gas mileage for this make and model of car. Further tests should be done on other types of vehicles.
The heights of a group of students are all included in the first histogram. The second histogram only contains the data from the male students and the third is a graph of the heights of only the girls. Explain what these histograms show.
The range of heights of all students in this group is approximately 20 inches. However, the female heights only range about 11 inches and the male heights only range about 13 inches. The females' height distribution is the most symmetrical of all three. There is one male whose height is a high outlier, but none for the females. The median height for the class is around 70 inches, for males it is slightly higher around 72 inches, and for females it is around 65 inches tall. In general, the female students tend to be shorter than the male students.
Problem Set 5.6
Section 5.6 Exercises
1) Compare the %Daily Value for Total Fat(g) to the %Daily Value for Saturated Fat(g) for these McDonald's® sandwiches.
a) Calculate the five number summary for both %Daily Values.
b) Construct parallel box plots for both.
d) Make at least four observations to compare these two distributions.
Source: http://nutrition.mcdonalds.com. July 27, 2011.
2) The heights of the students in a statistics class were all measured to the nearest inch. The results are presented in this back-to-back stem plot. Notice that it is also a split stem plot. The girls' heights are ordered out to the right on the right side. And the boys' heights are ordered out to the left on the left side.
a) Compute the standard deviation, the range, and the IQR for both girls and boys.
b) Compare the spread for the two groups, based on your answers to (a), in context.
c) Compute the mean, median, and mode for both boys and girls.
d) Compare the center for the two groups, based on your answers to (c), in context.
e) Compare the shape of the distributions, based on the graph, in context.
3) Compare the results of the Probability and Statistics District Common Assessment for two statistics classes.
a) Construct back-to-back stem plots (use split-stems) for these two classes.
b) Calculate the five number summaries for both classes.
c) Calculate the following statistics for both classes: mean, standard deviation, mode, range, and IQR.
d) Compare and contrast the two distributions. This should be in context and you should make at least four distinct observations.
4) The number of home-runs during a season is one of the statistics recorded about baseball players. The following table has the number of home-runs (over many seasons) for several of the best hitters in baseball. Compare the home-run hitting performance of these exceptional baseball players.
a) Calculate the following statistics for all four players:
b) Construct Parallel Box Plots for the four players. Be sure to use the same scale for all four graphs and to label each graph.
c) Test for outliers, for all four players, using the 1.5*IQR criterion-(show work).
d) Compare and contrast the four distributions. This should be in context and you should make at least four distinct observations.
5) The following box plots show the average miles per gallon (city) for various types of vehicles. Comment on what these parallel box plots show. This should be in context and include at least 4 distinct observations. The dots represent outliers for that data set.
boxplot(MPG.city~Type) # base package
6) Refer to the four dot plots to answer the questions that follow.
a) Identify the overall shape of each distribution.
b) How would you characterize the center(s) of these distributions?
c) Name at least two statistics that would most likely be the same for all four of these distributions.
d) Which of these distributions has the smallest standard deviation? Which of these distributions has the largest standard deviation? Explain.
e) For which of these distributions would it be appropriate to use the mean and standard deviation as numerical summaries? For which would the five number summary be more appropriate?