2.2: Common Graphs and Data Plots
Learning Objectives
- Identify and translate data sets to and from a bar graph and a pie graph.
- Identify and translate data sets to and from a dot plot.
- Identify and translate data sets to and from a stem-and-leaf plot.
- Identify and translate data sets to and from a scatterplot and a line graph.
- Identify graph distribution shapes as skewed or symmetric, and understand the basic implication of these shapes.
- Compare distributions of univariate data (shape, center, spread, and outliers).
Introduction
In this section, we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few more ways to represent single numerical variables, we will also study methods for displaying categorical variables. You will also be introduced to using a scatterplot and a line graph to show the relationship between two variables.
Categorical Variables: Bar Graphs and Pie Graphs
Example: E-Waste and Bar Graphs
We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill, where the chemicals from batteries and other electronics add toxins to the environment. Approximately 80% of the electronics discarded in the United States is also exported to third world countries, where it is disposed of under generally hazardous conditions by unprotected workers.\begin{align*}^1\end{align*} The following table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005.
Electronic Equipment | Thousands of Tons Discarded |
---|---|
Cathode Ray Tube (CRT) TV's | 7591.1 |
CRT Monitors | 389.8 |
Printers, Keyboards, Mice | 324.9 |
Desktop Computers | 259.5 |
Laptop Computers | 30.8 |
Projection TV's | 132.8 |
Cell Phones | 11.7 |
LCD Monitors | 4.9 |
Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1, pg 73.
The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below:
While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. The graph is just a series of disjoint categories.
Please note that discussions of shape, center, and spread have no meaning for a bar graph, and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!
Pie Graphs
Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.
Here is a table with the percentages and the approximate angle measure of each sector:
Electronic Equipment | Thousands of Tons Discarded | Percentage of Total Discarded | Angle Measure of Circle Sector |
---|---|---|---|
Cathode Ray Tube (CRT) TV's | 7591.1 | 86.8 | 312.5 |
CRT Monitors | 389.8 | 4.5 | 16.2 |
Printers, Keyboards, Mice | 324.9 | 3.7 | 13.4 |
Desktop Computers | 259.5 | 3.0 | 10.7 |
Laptop Computers | 30.8 | 0.4 | 1.3 |
Projection TV's | 132.8 | 1.5 | 5.5 |
Cell Phones | 11.7 | 0.1 | 0.5 |
LCD Monitors | 4.9 | \begin{align*}\sim 0\end{align*} | 0.2 |
And here is the completed pie graph:
Displaying Univariate Data
Dot Plots
A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.
Example: The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.
Country | % of Paper Packaging Recycled |
---|---|
Estonia | 34 |
New Zealand | 40 |
Poland | 40 |
Cyprus | 42 |
Portugal | 56 |
United States | 59 |
Italy | 62 |
Spain | 63 |
Australia | 66 |
Greece | 70 |
Finland | 70 |
Ireland | 70 |
Netherlands | 70 |
Sweden | 76 |
France | 76 |
Germany | 83 |
Austria | 83 |
Belgium | 83 |
Japan | 98 |
The dot plot for this data would look like this:
Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.
Stem-and-Leaf Plots
One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:
Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:
It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:
Note: \begin{align*}5|69\end{align*} means 56% and 59% are the two values in the 50's.
If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.
Example: Consider the following populations of counties in California.
Butte - 220,748
Calaveras - 45,987
Del Norte - 29,547
Fresno - 942,298
Humboldt - 132,755
Imperial - 179,254
San Francisco - 845,999
Santa Barbara - 431,312
To construct a stem and leave plot, we need to either round or truncate to two digits.
Value | Value Rounded | Value Truncated |
---|---|---|
149 | 15 | 14 |
657 | 66 | 65 |
188 | 19 | 18 |
\begin{align*}2|2\end{align*} represents \begin{align*}220,000 - 229,999\end{align*} when data has been truncated
\begin{align*}2|2\end{align*} represents \begin{align*}215,000 - 224,999\end{align*} when data has been rounded.
If we decide to round the above data, we have:
Butte - 220,000
Calaveras - 46,000
Del Norte - 30,000
Fresno - 940,000
Humboldt - 130,000
Imperial - 180,000
San Francisco - 850,000
Santa Barbara - 430,000
And the stem and leaf will be as follows:
where:
\begin{align*}2|2\end{align*} represents \begin{align*}220,000 - 224,999\end{align*}.
Source: California State Association of Counties http://www.counties.org/default,asp?id=399
Back-to-Back Stem Plots
Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called back-to-back stem plots.
In the previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:
Country | % of Glass Packaging Recycled |
---|---|
Cyprus | 4 |
United States | 21 |
Poland | 27 |
Greece | 34 |
Portugal | 39 |
Spain | 41 |
Australia | 44 |
Ireland | 56 |
Italy | 56 |
Finland | 56 |
France | 59 |
Estonia | 64 |
New Zealand | 72 |
Netherlands | 76 |
Germany | 81 |
Austria | 86 |
Japan | 96 |
Belgium | 98 |
Sweden | 100 |
In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.
We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.
Displaying Bivariate Data
Scatterplots and Line Plots
Bivariate simply means two variables. All our previous work was with univariate, or single-variable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables.
Example: We have looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Following is a data table that includes both percentages.
Country | % of Paper Packaging Recycled | % of Glass Packaging Recycled |
---|---|---|
Estonia | 34 | 64 |
New Zealand | 40 | 72 |
Poland | 40 | 27 |
Cyprus | 42 | 4 |
Portugal | 56 | 39 |
United States | 59 | 21 |
Italy | 62 | 56 |
Spain | 63 | 41 |
Australia | 66 | 44 |
Greece | 70 | 34 |
Finland | 70 | 56 |
Ireland | 70 | 55 |
Netherlands | 70 | 76 |
Sweden | 70 | 100 |
France | 76 | 59 |
Germany | 83 | 81 |
Austria | 83 | 44 |
Belgium | 83 | 98 |
Japan | 98 | 96 |
Figure: Paper and Glass Packaging Recycling Rates for 19 countries
Scatterplots
We will place the paper recycling rates on the horizontal axis and those for glass on the vertical axis. Next, we will plot a point that shows each country's rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.
Recall that one of the things you saw from the stem-and-leaf plot is that, in general, a country's recycling rate for glass is lower than its paper recycling rate. On the next graph, we have plotted a line that represents the paper and glass recycling rates being equal. If all the countries had the same paper and glass recycling rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.
With univariate data, we initially characterize a data set by describing its shape, center, and spread. For bivariate data, we will also discuss three important characteristics: shape, direction, and strength. These characteristics will inform us about the association between the two variables. The easiest way to describe these traits for this scatterplot is to think of the data as a cloud. If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.
Data that are oriented in this manner are said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recycling rates. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud were trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.
The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between the glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean a strong linear association, while shorter and wider ones show a weaker linear relationship. In this example, there are some countries for which the glass and paper recycling rates do not seem to be related.
New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass recycling rates, and Austria (circled in green) is an example of a country with a much lower glass recycling rate than its paper recycling rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, weakening the association between the variables.
On the Web
http://tinyurl.com/y8vcm5y Guess the correlation.
Line Plots
Example: The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990's:
Year | Municipal Waste Generated (Millions of Tons) |
---|---|
1990 | 269 |
1991 | 294 |
1992 | 281 |
1993 | 292 |
1994 | 307 |
1995 | 323 |
1996 | 327 |
1997 | 327 |
1998 | 340 |
Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
In this example, the time in years is considered the explanatory variable, or independent variable, and the amount of municipal waste is the response variable, or dependent variable. It is not only the passage of time that causes our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. However, it would not make sense to view the relationship between time and municipal waste in the opposite direction.
When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable, and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case, we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data values are changing over a period of time. Here is the line plot for the US Municipal Waste data:
It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred (1990) by looking at the steepest line. We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.
Lesson Summary
Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as, a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stem-and-leaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points. It can be identified as being weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable increase, the data have a positive association. If the response variable generally decreases as the explanatory variable increases, the data have a negative association. In a line graph, there is significance to the change between consecutive points, so these points are connected. Line graphs are often used when the explanatory variable is time.
Points to Consider
- What characteristics of a data set make it easier or harder to represent using dot plots, stem-and-leaf plots, or histograms?
- Which plots are most useful to interpret the ideas of shape, center, and spread?
- What effects does the shape of a data set have on the statistical measures of center and spread?
Multimedia Links
For a description of how to draw a stem-and-leaf plot, as well as how to derive information from one (14.0), see APUS07, Stem-and-Leaf Plot (8:08).
Review Questions
- Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately 27 kg. Some of the more hazardous substances, like Mercury, have been included in the 'other' category, because they occur in relatively small amounts that are still dangerous and toxic.
Material | Kilograms |
---|---|
Plastics | 6.21 |
Lead | 1.71 |
Aluminum | 3.83 |
Iron | 5.54 |
Copper | 2.12 |
Tin | 0.27 |
Zinc | 0.60 |
Nickel | 0.23 |
Barium | 0.05 |
Other elements and chemicals | 6.44 |
Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm
(a) Create a bar graph for this data.
(b) Complete the chart below to show the approximate percentage of the total weight for each material.
Material | Kilograms | Approximate Percentage of Total Weight |
---|---|---|
Plastics | 6.21 | |
Lead | 1.71 | |
Aluminum | 3.83 | |
Iron | 5.54 | |
Copper | 2.12 | |
Tin | 0.27 | |
Zinc | 0.60 | |
Nickel | 0.23 | |
Barium | 0.05 | |
Other elements and chemicals | 6.44 |
(c) Create a circle graph for this data.
- The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.
State | Percentage |
---|---|
Alabama | 23 |
Alaska | 7 |
Arizona | 18 |
Arkansas | 36 |
California | 30 |
Colorado | 18 |
Connecticut | 23 |
Delaware | 31 |
District of Columbia | 8 |
Florida | 40 |
Georgia | 33 |
Hawaii | 25 |
Illinois | 28 |
Indiana | 23 |
Iowa | 32 |
Kansas | 11 |
Kentucky | 28 |
Louisiana | 14 |
Maine | 41 |
Maryland | 29 |
Massachusetts | 33 |
Michigan | 25 |
Minnesota | 42 |
Mississippi | 13 |
Missouri | 33 |
Montana | 5 |
Nebraska | 27 |
Nevada | 15 |
New Hampshire | 25 |
New Jersey | 45 |
New Mexico | 12 |
New York | 39 |
North Carolina | 26 |
North Dakota | 21 |
Ohio | 19 |
Oklahoma | 12 |
Oregon | 28 |
Pennsylvania | 26 |
Rhode Island | 23 |
South Carolina | 34 |
South Dakota | 42 |
Tennessee | 40 |
Utah | 19 |
Vermont | 30 |
Virginia | 35 |
Washington | 48 |
West Virginia | 20 |
Wisconsin | 36 |
Wyoming | 5 |
Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
(a) Create a dot plot for this data.
(b) Discuss the shape, center, and spread of this distribution.
(c) Create a stem-and-leaf plot for the data.
(d) Use your stem-and-leaf plot to find the median percentage for this data.
- Identify the important features of the shape of each of the following distributions.
Questions 4-7 refer to the following dot plots:
- Identify the overall shape of each distribution.
- How would you characterize the center(s) of these distributions?
- Which of these distributions has the smallest standard deviation?
- Which of these distributions has the largest standard deviation?
- In question 2, you looked at the percentage of waste recycled in each state. Do you think there is a relationship between the percentage recycled and the total amount of waste that a state generates? Here are the data, including both variables.
State | Percentage | Total Amount of Municipal Waste in Thousands of Tons |
---|---|---|
Alabama | 23 | 5549 |
Alaska | 7 | 560 |
Arizona | 18 | 5700 |
Arkansas | 36 | 4287 |
California | 30 | 45000 |
Colorado | 18 | 3084 |
Connecticut | 23 | 2950 |
Delaware | 31 | 1189 |
District of Columbia | 8 | 246 |
Florida | 40 | 23617 |
Georgia | 33 | 14645 |
Hawaii | 25 | 2125 |
Illinois | 28 | 13386 |
Indiana | 23 | 7171 |
Iowa | 32 | 3462 |
Kansas | 11 | 4250 |
Kentucky | 28 | 4418 |
Louisiana | 14 | 3894 |
Maine | 41 | 1339 |
Maryland | 29 | 5329 |
Massachusetts | 33 | 7160 |
Michigan | 25 | 13500 |
Minnesota | 42 | 4780 |
Mississippi | 13 | 2360 |
Missouri | 33 | 7896 |
Montana | 5 | 1039 |
Nebraska | 27 | 2000 |
Nevada | 15 | 3955 |
New Hampshire | 25 | 1200 |
New Jersey | 45 | 8200 |
New Mexico | 12 | 1400 |
New York | 39 | 28800 |
North Carolina | 26 | 9843 |
North Dakota | 21 | 510 |
Ohio | 19 | 12339 |
Oklahoma | 12 | 2500 |
Oregon | 28 | 3836 |
Pennsylvania | 26 | 9440 |
Rhode Island | 23 | 477 |
South Carolina | 34 | 8361 |
South Dakota | 42 | 510 |
Tennessee | 40 | 9496 |
Utah | 19 | 3760 |
Vermont | 30 | 600 |
Virginia | 35 | 9000 |
Washington | 48 | 6527 |
West Virginia | 20 | 2000 |
Wisconsin | 36 | 3622 |
Wyoming | 5 | 530 |
(a) Identify the variables in this example, and specify which one is the explanatory variable and which one is the response variable.
(b) How much municipal waste was created in Illinois?
(c) Draw a scatterplot for this data.
(d) Describe the direction and strength of the association between the two variables.
- The following line graph shows the recycling rates of two different types of plastic bottles in the US from 1995 to 2001.
- Explain the general trends for both types of plastics over these years.
- What was the total change in PET bottle recycling from 1995 to 2001?
- Can you think of a reason to explain this change?
- During what years was this change the most rapid?
References
National Geographic, January 2008. Volume 213 No.1
\begin{align*}^1\end{align*}http://www.etoxics.org/site/PageServer?pagename=svtc_global_ewaste_crisis'
http://www.earth-policy.org/Updates/2006/Update51_data.htm
Technology Notes: Scatterplots on the TI-83/84 Graphing Calculator
Press [STAT][ENTER], and enter the following data, with the explanatory variable in L1 and the response variable in L2. Next, press [2ND][STAT-PLOT] to enter the STAT-PLOTS menu, and choose the first plot.
Change the settings to match the following screenshot: