2.2: Common Graphs and Data Plots
Learning Objectives
 Identify and translate data sets to and from a bar graph and a pie graph.
 Identify and translate data sets to and from a dot plot.
 Identify and translate data sets to and from a stemandleaf plot.
 Identify and translate data sets to and from a scatterplot and a line graph.
 Identify graph distribution shapes as skewed or symmetric, and understand the basic implication of these shapes.
 Compare distributions of univariate data (shape, center, spread, and outliers).
Introduction
In this section, we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few more ways to represent single numerical variables, we will also study methods for displaying categorical variables. You will also be introduced to using a scatterplot and a line graph to show the relationship between two variables.
Categorical Variables: Bar Graphs and Pie Graphs
Example: EWaste and Bar Graphs
We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill, where the chemicals from batteries and other electronics add toxins to the environment. Approximately 80% of the electronics discarded in the United States is also exported to third world countries, where it is disposed of under generally hazardous conditions by unprotected workers.
Electronic Equipment  Thousands of Tons Discarded 

Cathode Ray Tube (CRT) TV's  7591.1 
CRT Monitors  389.8 
Printers, Keyboards, Mice  324.9 
Desktop Computers  259.5 
Laptop Computers  30.8 
Projection TV's  132.8 
Cell Phones  11.7 
LCD Monitors  4.9 
Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1, pg 73.
The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below:
While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. The graph is just a series of disjoint categories.
Please note that discussions of shape, center, and spread have no meaning for a bar graph, and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!
Pie Graphs
Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.
Here is a table with the percentages and the approximate angle measure of each sector:
Electronic Equipment  Thousands of Tons Discarded  Percentage of Total Discarded  Angle Measure of Circle Sector 

Cathode Ray Tube (CRT) TV's  7591.1  86.8  312.5 
CRT Monitors  389.8  4.5  16.2 
Printers, Keyboards, Mice  324.9  3.7  13.4 
Desktop Computers  259.5  3.0  10.7 
Laptop Computers  30.8  0.4  1.3 
Projection TV's  132.8  1.5  5.5 
Cell Phones  11.7  0.1  0.5 
LCD Monitors  4.9 

0.2 
And here is the completed pie graph:
Displaying Univariate Data
Dot Plots
A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.
Example: The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.
Country  % of Paper Packaging Recycled 

Estonia  34 
New Zealand  40 
Poland  40 
Cyprus  42 
Portugal  56 
United States  59 
Italy  62 
Spain  63 
Australia  66 
Greece  70 
Finland  70 
Ireland  70 
Netherlands  70 
Sweden  76 
France  76 
Germany  83 
Austria  83 
Belgium  83 
Japan  98 
The dot plot for this data would look like this:
Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.
StemandLeaf Plots
One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stemandleaf plot is a similar plot in which it is much easier to read the actual data values. In a stemandleaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:
Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:
It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:
Note:
If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.
Example: Consider the following populations of counties in California.
Butte  220,748
Calaveras  45,987
Del Norte  29,547
Fresno  942,298
Humboldt  132,755
Imperial  179,254
San Francisco  845,999
Santa Barbara  431,312
To construct a stem and leave plot, we need to either round or truncate to two digits.
Value  Value Rounded  Value Truncated 

149  15  14 
657  66  65 
188  19  18 
If we decide to round the above data, we have:
Butte  220,000
Calaveras  46,000
Del Norte  30,000
Fresno  940,000
Humboldt  130,000
Imperial  180,000
San Francisco  850,000
Santa Barbara  430,000
And the stem and leaf will be as follows:
where:
Source: California State Association of Counties http://www.counties.org/default,asp?id=399
BacktoBack Stem Plots
Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called backtoback stem plots.
In the previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:
Country  % of Glass Packaging Recycled 

Cyprus  4 
United States  21 
Poland  27 
Greece  34 
Portugal  39 
Spain  41 
Australia  44 
Ireland  56 
Italy  56 
Finland  56 
France  59 
Estonia  64 
New Zealand  72 
Netherlands  76 
Germany  81 
Austria  86 
Japan  96 
Belgium  98 
Sweden  100 
In a backtoback stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.
We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.
Displaying Bivariate Data
Scatterplots and Line Plots
Bivariate simply means two variables. All our previous work was with univariate, or singlevariable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables.
Example: We have looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Following is a data table that includes both percentages.
Country  % of Paper Packaging Recycled  % of Glass Packaging Recycled 

Estonia  34  64 
New Zealand  40  72 
Poland  40  27 
Cyprus  42  4 
Portugal  56  39 
United States  59  21 
Italy  62  56 
Spain  63  41 
Australia  66  44 
Greece  70  34 
Finland  70  56 
Ireland  70  55 
Netherlands  70  76 
Sweden  70  100 
France  76  59 
Germany  83  81 
Austria  83  44 
Belgium  83  98 
Japan  98  96 
Figure: Paper and Glass Packaging Recycling Rates for 19 countries
Scatterplots
We will place the paper recycling rates on the horizontal axis and those for glass on the vertical axis. Next, we will plot a point that shows each country's rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.
Recall that one of the things you saw from the stemandleaf plot is that, in general, a country's recycling rate for glass is lower than its paper recycling rate. On the next graph, we have plotted a line that represents the paper and glass recycling rates being equal. If all the countries had the same paper and glass recycling rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.
With univariate data, we initially characterize a data set by describing its shape, center, and spread. For bivariate data, we will also discuss three important characteristics: shape, direction, and strength. These characteristics will inform us about the association between the two variables. The easiest way to describe these traits for this scatterplot is to think of the data as a cloud. If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.
Data that are oriented in this manner are said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recycling rates. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud were trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.
The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between the glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean a strong linear association, while shorter and wider ones show a weaker linear relationship. In this example, there are some countries for which the glass and paper recycling rates do not seem to be related.
New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass recycling rates, and Austria (circled in green) is an example of a country with a much lower glass recycling rate than its paper recycling rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, weakening the association between the variables.
On the Web
http://tinyurl.com/y8vcm5y Guess the correlation.
Line Plots
Example: The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990's:
Year  Municipal Waste Generated (Millions of Tons) 

1990  269 
1991  294 
1992  281 
1993  292 
1994  307 
1995  323 
1996  327 
1997  327 
1998  340 
Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
In this example, the time in years is considered the explanatory variable, or independent variable, and the amount of municipal waste is the response variable, or dependent variable. It is not only the passage of time that causes our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. However, it would not make sense to view the relationship between time and municipal waste in the opposite direction.
When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable, and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case, we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data values are changing over a period of time. Here is the line plot for the US Municipal Waste data:
It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred (1990) by looking at the steepest line. We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.
Lesson Summary
Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as, a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stemandleaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points. It can be identified as being weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable increase, the data have a positive association. If the response variable generally decreases as the explanatory variable increases, the data have a negative association. In a line graph, there is significance to the change between consecutive points, so these points are connected. Line graphs are often used when the explanatory variable is time.
Points to Consider
 What characteristics of a data set make it easier or harder to represent using dot plots, stemandleaf plots, or histograms?
 Which plots are most useful to interpret the ideas of shape, center, and spread?
 What effects does the shape of a data set have on the statistical measures of center and spread?
Multimedia Links
For a description of how to draw a stemandleaf plot, as well as how to derive information from one (14.0), see APUS07, StemandLeaf Plot (8:08).
Review Questions
 Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately 27 kg. Some of the more hazardous substances, like Mercury, have been included in the 'other' category, because they occur in relatively small amounts that are still dangerous and toxic.
Material  Kilograms 

Plastics  6.21 
Lead  1.71 
Aluminum  3.83 
Iron  5.54 
Copper  2.12 
Tin  0.27 
Zinc  0.60 
Nickel  0.23 
Barium  0.05 
Other elements and chemicals  6.44 
Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm
(a) Create a bar graph for this data.
(b) Complete the chart below to show the approximate percentage of the total weight for each material.
Material  Kilograms  Approximate Percentage of Total Weight 

Plastics  6.21  
Lead  1.71  
Aluminum  3.83  
Iron  5.54  
Copper  2.12  
Tin  0.27  
Zinc  0.60  
Nickel  0.23  
Barium  0.05  
Other elements and chemicals  6.44 
(c) Create a circle graph for this data.
 The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.
State  Percentage 

Alabama  23 
Alaska  7 
Arizona  18 
Arkansas  36 
California  30 
Colorado  18 
Connecticut  23 
Delaware  31 
District of Columbia  8 
Florida  40 
Georgia  33 
Hawaii  25 
Illinois  28 
Indiana  23 
Iowa  32 
Kansas  11 
Kentucky  28 
Louisiana  14 
Maine  41 
Maryland  29 
Massachusetts  33 
Michigan  25 
Minnesota  42 
Mississippi  13 
Missouri  33 
Montana  5 
Nebraska  27 
Nevada  15 
New Hampshire  25 
New Jersey  45 
New Mexico  12 
New York  39 
North Carolina  26 
North Dakota  21 
Ohio  19 
Oklahoma  12 
Oregon  28 
Pennsylvania  26 
Rhode Island  23 
South Carolina  34 
South Dakota  42 
Tennessee  40 
Utah  19 
Vermont  30 
Virginia  35 
Washington  48 
West Virginia  20 
Wisconsin  36 
Wyoming  5 
Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
(a) Create a dot plot for this data.
(b) Discuss the shape, center, and spread of this distribution.
(c) Create a stemandleaf plot for the data.
(d) Use your stemandleaf plot to find the median percentage for this data.
 Identify the important features of the shape of each of the following distributions.
Questions 47 refer to the following dot plots:
 Identify the overall shape of each distribution.
 How would you characterize the center(s) of these distributions?
 Which of these distributions has the smallest standard deviation?
 Which of these distributions has the largest standard deviation?
 In question 2, you looked at the percentage of waste recycled in each state. Do you think there is a relationship between the percentage recycled and the total amount of waste that a state generates? Here are the data, including both variables.
State  Percentage  Total Amount of Municipal Waste in Thousands of Tons 

Alabama  23  5549 
Alaska  7  560 
Arizona  18  5700 
Arkansas  36  4287 
California  30  45000 
Colorado  18  3084 
Connecticut  23  2950 
Delaware  31  1189 
District of Columbia  8  246 
Florida  40  23617 
Georgia  33  14645 
Hawaii  25  2125 
Illinois  28  13386 
Indiana  23  7171 
Iowa  32  3462 
Kansas  11  4250 
Kentucky  28  4418 
Louisiana  14  3894 
Maine  41  1339 
Maryland  29  5329 
Massachusetts  33  7160 
Michigan  25  13500 
Minnesota  42  4780 
Mississippi  13  2360 
Missouri  33  7896 
Montana  5  1039 
Nebraska  27  2000 
Nevada  15  3955 
New Hampshire  25  1200 
New Jersey  45  8200 
New Mexico  12  1400 
New York  39  28800 
North Carolina  26  9843 
North Dakota  21  510 
Ohio  19  12339 
Oklahoma  12  2500 
Oregon  28  3836 
Pennsylvania  26  9440 
Rhode Island  23  477 
South Carolina  34  8361 
South Dakota  42  510 
Tennessee  40  9496 
Utah  19  3760 
Vermont  30  600 
Virginia  35  9000 
Washington  48  6527 
West Virginia  20  2000 
Wisconsin  36  3622 
Wyoming  5  530 
(a) Identify the variables in this example, and specify which one is the explanatory variable and which one is the response variable.
(b) How much municipal waste was created in Illinois?
(c) Draw a scatterplot for this data.
(d) Describe the direction and strength of the association between the two variables.
 The following line graph shows the recycling rates of two different types of plastic bottles in the US from 1995 to 2001.
 Explain the general trends for both types of plastics over these years.
 What was the total change in PET bottle recycling from 1995 to 2001?
 Can you think of a reason to explain this change?
 During what years was this change the most rapid?
References
National Geographic, January 2008. Volume 213 No.1
http://www.earthpolicy.org/Updates/2006/Update51_data.htm
Technology Notes: Scatterplots on the TI83/84 Graphing Calculator
Press [STAT][ENTER], and enter the following data, with the explanatory variable in L1 and the response variable in L2. Next, press [2ND][STATPLOT] to enter the STATPLOTS menu, and choose the first plot.
Change the settings to match the following screenshot: