5.3: Numerical Data: Dot Plots & Stem Plots
Learning Objectives
- Construct dot plots, stem plots and split-stem plots
- Calculate numerical statistics for quantitative data
- Identify potential outliers in a distribution
- Describe distributions in context-- including shape, outliers, center, and spread
Dot Plots
One convenient way to organize numerical data is a dot plot. A dot plot is a simple display that places a dot (or X, or another symbol) above an axis for each datum value (datum is the singular of data). The axis should cover the entire range of the data, even numbers that will have no data marked above them should be included to show outliers or gaps. There is a dot for each value, so values that occur more than once will be shown by stacked dots. Dot plots are especially useful when you are working with a small set of data across a reasonably small range of values. This type of graph gives a clear view of the shape, any mode(s) and the range of a set of data. The numbers are already in order, so finding the median is fairly quick. And any outliers are quickly visible.
Ages of all of the Sales People at Stinky's Car Dealership.
License: CC BY-NC 3.0
Describing a Numerical Distribution
Once you have constructed a graphical representation of a data set, the next step is to describe what the graph shows. There are several characteristics that should be mentioned when describing a numerical distribution, and your description needs to explain what this specific data represents. Describe the shape of the graph, whether or not there are any outliers present in the data, the location of the center of the data and how spread out the data is. All of this should be done in the specific context of the individuals and variable being studied. We will use an acronym to help you remember what to include in your descriptions (S.O.C.C.S.) - shape, outliers, context, center and spread. An explanation of each of these characteristics follows.
Shape
Once a graphical display is constructed, we can describe the distribution. When describing the distribution, we should be sure to address its shape. Although many graphs will not have a clear or exact shape, we can usually identify the shape as symmetrical or skewed. A symmetrical distribution will have a middle where we can draw an imaginary line through the center, and a fairly equal "look" on either side of that imaginary line. If you were to fold along the imaginary center line, the two sides would almost match up. Many symmetrical distributions are bell shaped, they will be tall in the middle with the two sides thinning out. The sides are referred to as tails. A skewed distribution is one in which the bulk of the data is concentrated on one end, with the other side being a longer tail. The direction of the longer tail is the direction of the skew. Skewed right will have a longer tail to the right, or higher numbers. Skewed left will have a longer tail off to the left, or the lower values. Other shapes that you might see are uniform (almost consistent height all the way across) and bimodal (having two peaks in the distribution).
License: CC BY-NC 3.0
Outliers
If there are any outliers, gaps, groupings, or other unusual features in the distribution, we should be sure to mention them. An outlier is a value that does not fit with the rest of the data. Some distributions will have several outliers, while others will not have any. We should always look for outliers because they can affect many of our statistics. Also, sometimes an outlier is actually an error that needs to be corrected. If you have ever 'bombed' one test in a class, you probably discovered that it had a big impact on your overall average in that class. This is because the mean will be affected by an outlier-it will be pulled toward it. This is another reason why we should be sure to look at the data, not just look at the statistics about the data. When an outlier is part of the data and we do not realize it, we can be misled by the mean to believe that the numbers are higher or lower than they really are.
Context
Do not forget that the graph, the numbers and the descriptions are all about something--its context. All of these elements of the distribution should be described in the specific context of the situation in question.
Center
The center of the distribution should always be included in the verbal analysis as well. People often wonder what the 'average is'. The measure for center can be reported as the median, the mean, or the mode. Even better, give more than one of these in your description. Remember that outliers affect the mean, but do not affect the median. For example, the median of a list of data will stay in the center even when the largest value increases tremendously, but such a change would affect the mean quite a bit.
Spread
Another thing to include in the description is the spread of the data. The spread is the specific range of the data. When analyzing a distribution, we don't want to simply say that the range is equal to some number. It is much more informative to say that the data ranges from_____ to ______ (minimum value to maximum value). For example, if the news reports that the temperature in St. Paul had a range of 20^{o} during a given week, this could mean very different temperatures depending on the time of year. It would be more informative to say something specific like, the temperature in St. Paul ranged from 68^{o} to 88^{o} last week.
S.O.C.C.S.
So, when you describe the distribution of a numerical variable, there are several things to include. This text will use the acronym S.O.C.C.S! (shape, outliers, context, center, spread) to help us remember what characteristics to include in our descriptions.
Example 1
An anthropology instructor at the community college is interested in analyzing the age distribution of her students. The students in her Anthropology 102 class are: 21, 23, 25, 26, 25, 24, 26, 19, 18, 19, 26, 28, 24, 22, 24, 19, 23, 24, 24, 21, 23, and 28 years old. Organize the data in a dot plot. Calculate the mean, median, mode, and range for the distribution. Describe the distribution. Be sure to include the shape, outliers, center, context, and spread.
Solution
a) construct a dot plot
License: CC BY-NC 3.0
Ages of Students in Anthropology 102
b) mean- (18+19+19+19+21+21+22+23+23+23+24+24+24+24+24+25+25+26+26+26+28+28)/22 = 23.2727... mean years old
median- already listed in order, count to find "middle number", it is between 24 and 24, find mean of these two numbers (24+24)/2=24 median = Med = 24 years old
mode- look for most frequent age, it is 24 mode = 24 years old
range- min age is 18, max age is 28 range is years or ages range from 18 to 28 years
c) describe- address the shape, outliers, center, context, and spread of the distribution (This could be described as fairly symmetrical or slightly skewed to the left)
The distribution of student ages in this Anthropology 102 class is fairly symmetrical with no clear ouliers. The ages of students range from 18 to 28 years old. The median and mode for age are both 24 years old and the mean is 23.27 years. Thus, the typical student in this class is 23-24 years of age.
Stem Plots
In statistics, data is represented in tables, charts or graphs. One disadvantage of representing data in these ways is that the specific data values are often not retained. Using a stem plot is one way to ensure that the data values are kept intact. A stem plot is a method of organizing the data that includes sorting the data and graphing it at the same time. This type of graph uses the stem as the leading part of the data value and the leaf as the remaining part of the value. The result is a graph that displays the sorted data in groups or classes. A stem plot is used with numerical data when it will be helpful to see the actual values organized in order.
To construct a stem plot you must first determine the range of your distribution. Build the stems so that they cover the entire range, include every stem even if it will have no values after it. This will allow us to see the true shape of the distribution including outliers, whether it is skewed, and any gaps. Then place all of the "leaves" after the appropriate stems. Place the numbers in ascending order out and include all values, so repeats will show more than once. Some people like to put the numbers in order before they construct the stem plot, some like to try to put them in order as they make the plot, and others like to make a rough draft first without regard to order and then to make a final copy with the numbers in the correct order. Any of these methods will result in a correct stem plot.
Example 2
A researcher was studying the growth of a certain plant. She planted 25 seeds and kept watering, sunlight, and temperature as consistent as possible. The following numbers represent the growth (in centimeters) of the plants after 28 days.
a) Construct a stem plot
b) Describe the distribution.
Solution
a) Construct a stem plot- Notice that the stem plot has the numbers in the correct order (ascending as you go out), and includes a key and title.
License: CC BY-NC 3.0
b) Describe the distribution- Be sure to address shape, outliers, center, context, & spread.
The distribution of growth at 28 days ranged from 10 to 61 centimeters for these plants with the majority of plants growing at least 30cm. The median height was 41cm after 28 days. The shape is bimodal and there is a gap in the distribution because there are no plants in the 20-29 cm class. There are some possible low outliers, but no high outliers for plant growth.
Example 3
Sometimes a stem plot ends up looking too crowded. When the data is concentrated in a few rows, or 'classes', it can be difficult to determine what the shape is or whether there are any outliers in the data. In this example, the stem plot for the ages of a group of people was really concentrated in the 30s and 40s (plot on left). However, the statistician looking at this was not satisfied with the crowded appearance, so she decided to 'split' the stems. The resulting graph on the right, called a split-stem plot, shows very different results. Describe the distribution based on the split-stem plot.
License: CC BY-NC 3.0key 5|3 = 53 years [Figure6]
Solution
To split the stems, each stem was written twice. The top one is for the first half of the leaves in that class, and the second one is for the leaves in the second half of that class. For example the first stem of 4 gets 40 to 44, and the second 4 gets 45 to 49. So, when splitting stems into two, the number 5 is the cutoff for moving into the second part (just like rounding).
The split-stem plot shows that the distribution of ages in this example is bimodal and skewed to the left (lower numbers). It also shows that the ages of 20 and 22 appear to be low outliers. None of this was visible in the regular stem plot. Both plots show that the ages range from 20 to 54 years, with a median age of 41 years old and a mode age of 47 years old.
Problem Set 5.3
Section 5.3 Exercises
1) The following is data representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.
Country | % of Paper Packaging Recycled |
Estonia | 34 |
New Zealand | 40 |
Poland | 40 |
Cyprus | 42 |
Portugal | 56 |
United States | 59 |
Italy | 62 |
Spain | 63 |
Australia | 66 |
Greece | 70 |
Finland | 70 |
Ireland | 70 |
Netherlands | 70 |
Sweden | 70 |
France | 76 |
Germany | 83 |
Austria | 83 |
Belgium | 83 |
Japan | 98 |
The dot plot for this data would look like this:
License: CC BY-NC 3.0
a) Calculate the mean, median, mode, and range for this set of data
b) Describe the distribution in context. Remember your S.O.C.C.S!
2) At the local veterinarian school, the number of animals treated each day over a period of 20 days was recorded.
License: CC BY-NC 3.0
a) Construct a stem plot for the data
b) Describe the distribution thoroughly. Remember your S.O.C.C.S!
3) The following table reports the percent of students who took the SAT for the 20 U.S. States with the highest participation rates for the 2004 SAT test. Source: http://mathforum.org
License: CC BY-NC 3.0
a) Create a split-stem plot for the data.
b) Find the median percentage for this data.
c) If we included the data from the other 30 states, would our mean and median be higher or lower? Explain.
d) Describe the distribution thoroughly. Remember your S.O.C.C.S! Specifically identify any states that stand out.
4) This stem plot is one that looks too crowded.
License: CC BY-NC 3.0
a) Create a split-stem plot for this example.
b) Name at least two things that are visible in the second plot that were not apparent in the first plot.
c) Invent a scenario that this data could represent.
5) Several game critics rated the Wow So Fit game, on a scale of 1 to 100 (100 being the highest rating). The results are presented in this stem plot:
License: CC BY-NC 3.0
a) Find the three measures of central tendency for the game rating data (mean, median and mode).
b) Which of these three measures of central tendency gives the best impression of the 'average' (typical) rating for this game? Explain.
6) These dot plots do not have any numbers or context. For each of the following dot plots:
a) Identify the shape of each distribution and whether or not there appear to be any outliers.
b) For each plot, determine whether the mean or median would be greater, or if they would be similar.
c) Suggest a possible variable that might have such a distribution. (In other words, invent a context that fits the graph.)
i)
License: CC BY-NC 3.0
ii)
License: CC BY-NC 3.0
iii)
License: CC BY-NC 3.0
iv)
License: CC BY-NC 3.0
7) This table displays statistics for 21 of the Wild players for 2010-2011 regular season games. We are going to analyze the variable 'GP', which stands for games played.
source: http://wild.nhl.com. July 25, 2011
a) Create a stem plot for the number of games played by these Wild players.
b) Calculate the mean, median, mode, range for the number of games played by these Wild players.
c) Describe the distribution of the number of games played by these players. Remember your S.O.C.C.S!
8) Now, you will examine the +/- data.
a) Find out what +/- stands for?
b) Construct a dot plot to show the +/- data.
c) Describe the distribution.
Review Exercises
9) A random poll was conducted in Springfield to determine what percent of people enjoy watching The Simpsons. Of the 1245 people surveyed, 1002 said that they do enjoy watching The Simpsons. Identify each of the following.
a) population of interest
b) parameter of interest
c) sample
d) statistic
e) margin of error
f) 95% confidence interval
f) confidence statement
Image Attributions
- [1]^ License: CC BY-NC 3.0
- [2]^ License: CC BY-NC 3.0
- [3]^ License: CC BY-NC 3.0
- [4]^ License: CC BY-NC 3.0
- [5]^ License: CC BY-NC 3.0
- [6]^ License: CC BY-NC 3.0
- [7]^ License: CC BY-NC 3.0
- [8]^ License: CC BY-NC 3.0
- [9]^ License: CC BY-NC 3.0
- [10]^ License: CC BY-NC 3.0
- [11]^ License: CC BY-NC 3.0
- [12]^ License: CC BY-NC 3.0
- [13]^ License: CC BY-NC 3.0
- [14]^ License: CC BY-NC 3.0
- [15]^ License: CC BY-NC 3.0
- [16]^ License: CC BY-NC 3.0
- [17]^ License: CC BY-NC 3.0