Graphs for Univariate Data
Univariate Data is composed of single numerical variables.
A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.
Constructing a Dot Plot
The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.
|Country||% of Paper Packaging Recycled|
The dot plot for this data would look like this:
Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.
One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:
Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:
It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:
Note: means 56% and 59% are the two values in the 50's.
If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.
Creating a Stem-and-Leaf Plot
Consider the following populations of counties in California.
Butte - 220,748
Calaveras - 45,987
Del Norte - 29,547
Fresno - 942,298
Humboldt - 132,755
Imperial - 179,254
San Francisco - 845,999
Santa Barbara - 431,312
To construct a stem and leaf plot, we need to first make sure each piece of data has the same number of digits. In our data, we will add a 0 at the beginning of our 5 digit data points so that all data points have six digits. Then, we can either round or truncate all data points to two digits.
|Value||Value Rounded||Value Truncated|
represents when data has been truncated
represents when data has been rounded.
If we decide to round the above data, we have:
Butte - 220,000
Calaveras - 050,000
Del Norte - 030,000
Fresno - 940,000
Humboldt - 130,000
Imperial - 180,000
San Francisco - 850,000
Santa Barbara - 430,000
And the stem and leaf will be as follows:
Source: California State Association of Counties
Back-to-Back Stem Plots
Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called back-to-back stem plots.
Constructing a Back-To-Back Stem Plot
In a previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:
|Country||% of Glass Packaging Recycled|
In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.
We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.
The following examples uses the data set below.
Here are the ages, arranged order, for the CEOs of the 60 top-ranked small companies in America in 1993:
32, 33, 36, 37, 38, 40, 41, 43, 43, 44, 44, 45, 45, 45, 45,46, 46, 47, 47, 47, 48, 48, 48, 48, 49, 50, 50, 50, 50, 50, 50, 51, 51, 52, 53, 53, 53, 55, 55, 55, 56, 56, 56, 56, 57, 57, 58, 58, 59, 60, 61, 61, 61, 62, 62, 63, 69, 69, 70, 74
Create a stem-and-leaf plot for these ages,
Here is the stem-and-leaf plot:
Create a dot plot for these ages.
Here is the dot plot:
Describe the shape of this data set.
The data set is approximately symmetric with most CEOs in their fifties.
Are there any outliers in this data set?
There do not appear to be any outliers.
For 1-4, the following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.
|District of Columbia||8|
Source: Zero Waste America
- Create a dot plot for this data.
- Discuss the shape, center, and spread of this distribution.
- Create a stem-and-leaf plot for the data.
- Use your stem-and-leaf plot to find the median percentage for this data.
For 5-8, identify the important features of the shape of the distribution.
For 9-12, refer to the following dot plots:
- Identify the overall shape of each distribution.
- How would you characterize the center(s) of these distributions?
- Which of these distributions has the smallest standard deviation?
- Which of these distributions has the largest standard deviation?
- What characteristics of a data set make it easier or harder to represent using dot plots, stem-and-leaf plots, or histograms?
- Here are the ages, arranged order, for the CEOs of the 60 top-ranked small companies in America in 1993 http://lib.stat.cmu.edu/DASL/Datafiles/ceodat.html32, 33, 36, 37, 38, 40, 41, 43, 43, 44, 44, 45, 45, 45, 45,46, 46, 47, 47, 47, 48, 48, 48, 48, 49, 50, 50, 50, 50, 50, 50, 51, 51, 52, 53, 53, 53, 55, 55, 55, 56, 56, 56, 56, 57, 57, 58, 58, 59, 60, 61, 61, 61, 62, 62, 63, 69, 69, 70, 74
- Create a stem-and-leaf plot for these ages.
- Create a dot plot for these ages.
- Describe the shape of this dataset.
- Are there any outliers in this dataset?
- Give an example in which the same measurement taken on the same individual would be considered to be an outlier in one dataset but not in another dataset.
- Does a stem and leaf plot provide enough information to determine if there are any outliers in the dataset? Explain.
- Does a five number summary provide enough information to determine if there are any outliers in the data set? Explain.
- A set of 17 exam scores is 67, 94, 88, 76, 85, 93, 55, 87, 80, 81, 80, 61, 90 ,84, 75, 93, 75
- Draw a stem-and-leaf plot of the scores.
- Draw a dotplot of the scores.
- Make a stem and leaf plot of the mean high temperature in December (Farenheit) in 15 cities in California. The “stem” gives the first digit of a temperature, while the “leaf” gives the second digit. You can find the data at: http://countrystudies.us/united-states/weather/California/beverly-hills.htm
- Describe the shape of the dataset. Is it skewed or is it symmetric?
- What is the highest temperature in the dataset?
- What is the lowest temperature in the dataset?
- What percent of the 15 cities have a mean high December temperature in the 60s?
To view the Review answers, open this PDF file and look for section 2.3.