<meta http-equiv="refresh" content="1; url=/nojavascript/">
You are reading an older version of this FlexBook® textbook: CK-12 Probability and Statistics - Advanced (Second Edition) Go to the latest version.

2.2: Common Graphs and Data Plots

Difficulty Level: At Grade Created by: CK-12

Learning Objectives

• Identify and translate data sets to and from a bar graph and a pie graph.
• Identify and translate data sets to and from a dot plot.
• Identify and translate data sets to and from a stem-and-leaf plot.
• Identify and translate data sets to and from a scatterplot and a line graph.
• Identify graph distribution shapes as skewed or symmetric, and understand the basic implication of these shapes.
• Compare distributions of univariate data (shape, center, spread, and outliers).

Introduction

In this section, we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few more ways to represent single numerical variables, we will also study methods for displaying categorical variables. You will also be introduced to using a scatterplot and a line graph to show the relationship between two variables.

Categorical Variables: Bar Graphs and Pie Graphs

Example: E-Waste and Bar Graphs

We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill, where the chemicals from batteries and other electronics add toxins to the environment. Approximately 80% of the electronics discarded in the United States is also exported to third world countries, where it is disposed of under generally hazardous conditions by unprotected workers.$^1$ The following table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005.

Electronic Equipment Thousands of Tons Discarded
Cathode Ray Tube (CRT) TV's 7591.1
CRT Monitors 389.8
Printers, Keyboards, Mice 324.9
Desktop Computers 259.5
Laptop Computers 30.8
Projection TV's 132.8
Cell Phones 11.7
LCD Monitors 4.9

Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1, pg 73.

The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below:

While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. The graph is just a series of disjoint categories.

Please note that discussions of shape, center, and spread have no meaning for a bar graph, and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!

Pie Graphs

Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.

Here is a table with the percentages and the approximate angle measure of each sector:

Electronic Equipment Thousands of Tons Discarded Percentage of Total Discarded Angle Measure of Circle Sector
Cathode Ray Tube (CRT) TV's 7591.1 86.8 312.5
CRT Monitors 389.8 4.5 16.2
Printers, Keyboards, Mice 324.9 3.7 13.4
Desktop Computers 259.5 3.0 10.7
Laptop Computers 30.8 0.4 1.3
Projection TV's 132.8 1.5 5.5
Cell Phones 11.7 0.1 0.5
LCD Monitors 4.9 $\sim 0$ 0.2

And here is the completed pie graph:

Displaying Univariate Data

Dot Plots

A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.

Example: The following is a data set representing the percentage of paper packaging manufactured from recycled materials for a select group of countries.

Percentage of the paper packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.1, pg 86-87.
Country % of Paper Packaging Recycled
Estonia 34
New Zealand 40
Poland 40
Cyprus 42
Portugal 56
United States 59
Italy 62
Spain 63
Australia 66
Greece 70
Finland 70
Ireland 70
Netherlands 70
Sweden 76
France 76
Germany 83
Austria 83
Belgium 83
Japan 98

The dot plot for this data would look like this:

Notice that this data set is centered at a manufacturing rate for using recycled materials of between 65 and 70 percent. It is spread from 34% to 98%, and appears very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.

Stem-and-Leaf Plots

One of the shortcomings of dot plots is that they do not show the actual values of the data. You have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is 34%, but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example, it makes sense to use the ten's digits for the stems and the one's digits for the leaves. The stems are on the left of a dividing line as follows:

Once the stems are decided, the leaves representing the one's digits are listed in numerical order from left to right:

It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:

Note: $5|69$ means 56% and 59% are the two values in the 50's.

If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found, and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create, depending on the numerical qualities and the spread of the data. If the data values contain more than two digits, you will need to remove some of the information by rounding. A data set that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.

Example: Consider the following populations of counties in California.

Butte - 220,748

Calaveras - 45,987

Del Norte - 29,547

Fresno - 942,298

Humboldt - 132,755

Imperial - 179,254

San Francisco - 845,999

Santa Barbara - 431,312

To construct a stem and leave plot, we need to either round or truncate to two digits.

Value Value Rounded Value Truncated
149 15 14
657 66 65
188 19 18

$2|2$ represents $220,000 - 229,999$ when data has been truncated

$2|2$ represents $215,000 - 224,999$ when data has been rounded.

If we decide to round the above data, we have:

Butte - 220,000

Calaveras - 46,000

Del Norte - 30,000

Fresno - 940,000

Humboldt - 130,000

Imperial - 180,000

San Francisco - 850,000

Santa Barbara - 430,000

And the stem and leaf will be as follows:

where:

$2|2$ represents $220,000 - 224,999$.

Source: California State Association of Counties http://www.counties.org/default,asp?id=399

Back-to-Back Stem Plots

Stem plots can also be a useful tool for comparing two distributions when placed next to each other. These are commonly called back-to-back stem plots.

In the previous example, we looked at recycling in paper packaging. Here are the same countries and their percentages of recycled material used to manufacture glass packaging:

Percentage of the glass packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.1, pg 86-87.
Country % of Glass Packaging Recycled
Cyprus 4
United States 21
Poland 27
Greece 34
Portugal 39
Spain 41
Australia 44
Ireland 56
Italy 56
Finland 56
France 59
Estonia 64
New Zealand 72
Netherlands 76
Germany 81
Austria 86
Japan 96
Belgium 98
Sweden 100

In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there are no data values in a stem, you must include it to preserve the spacing, or you will not get an accurate picture of the shape and spread.

We have already mentioned that the spread was larger in the glass distribution, and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-50's), which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so its effective percentage is actually more than 100.

Displaying Bivariate Data

Scatterplots and Line Plots

Bivariate simply means two variables. All our previous work was with univariate, or single-variable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables.

Example: We have looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Following is a data table that includes both percentages.

Country % of Paper Packaging Recycled % of Glass Packaging Recycled
Estonia 34 64
New Zealand 40 72
Poland 40 27
Cyprus 42 4
Portugal 56 39
United States 59 21
Italy 62 56
Spain 63 41
Australia 66 44
Greece 70 34
Finland 70 56
Ireland 70 55
Netherlands 70 76
Sweden 70 100
France 76 59
Germany 83 81
Austria 83 44
Belgium 83 98
Japan 98 96

Figure: Paper and Glass Packaging Recycling Rates for 19 countries

Scatterplots

We will place the paper recycling rates on the horizontal axis and those for glass on the vertical axis. Next, we will plot a point that shows each country's rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.

Recall that one of the things you saw from the stem-and-leaf plot is that, in general, a country's recycling rate for glass is lower than its paper recycling rate. On the next graph, we have plotted a line that represents the paper and glass recycling rates being equal. If all the countries had the same paper and glass recycling rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.

With univariate data, we initially characterize a data set by describing its shape, center, and spread. For bivariate data, we will also discuss three important characteristics: shape, direction, and strength. These characteristics will inform us about the association between the two variables. The easiest way to describe these traits for this scatterplot is to think of the data as a cloud. If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.

Data that are oriented in this manner are said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that countries with higher paper recycling rates have higher glass recycling rates. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud were trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country's glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.

The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between the glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean a strong linear association, while shorter and wider ones show a weaker linear relationship. In this example, there are some countries for which the glass and paper recycling rates do not seem to be related.

New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass recycling rates, and Austria (circled in green) is an example of a country with a much lower glass recycling rate than its paper recycling rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, weakening the association between the variables.

On the Web

http://tinyurl.com/y8vcm5y Guess the correlation.

Line Plots

Example: The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990's:

Year Municipal Waste Generated (Millions of Tons)
1990 269
1991 294
1992 281
1993 292
1994 307
1995 323
1996 327
1997 327
1998 340

Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm

In this example, the time in years is considered the explanatory variable, or independent variable, and the amount of municipal waste is the response variable, or dependent variable. It is not only the passage of time that causes our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. However, it would not make sense to view the relationship between time and municipal waste in the opposite direction.

When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable, and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case, we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data values are changing over a period of time. Here is the line plot for the US Municipal Waste data:

It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred (1990) by looking at the steepest line. We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.

Lesson Summary

Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as, a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stem-and-leaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be on the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points. It can be identified as being weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable increase, the data have a positive association. If the response variable generally decreases as the explanatory variable increases, the data have a negative association. In a line graph, there is significance to the change between consecutive points, so these points are connected. Line graphs are often used when the explanatory variable is time.

Points to Consider

• What characteristics of a data set make it easier or harder to represent using dot plots, stem-and-leaf plots, or histograms?
• Which plots are most useful to interpret the ideas of shape, center, and spread?
• What effects does the shape of a data set have on the statistical measures of center and spread?

For a description of how to draw a stem-and-leaf plot, as well as how to derive information from one (14.0), see APUS07, Stem-and-Leaf Plot (8:08).

Review Questions

1. Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately 27 kg. Some of the more hazardous substances, like Mercury, have been included in the 'other' category, because they occur in relatively small amounts that are still dangerous and toxic.
Material Kilograms
Plastics 6.21
Aluminum 3.83
Iron 5.54
Copper 2.12
Tin 0.27
Zinc 0.60
Nickel 0.23
Barium 0.05
Other elements and chemicals 6.44

Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm

(a) Create a bar graph for this data.

(b) Complete the chart below to show the approximate percentage of the total weight for each material.

Material Kilograms Approximate Percentage of Total Weight
Plastics 6.21
Aluminum 3.83
Iron 5.54
Copper 2.12
Tin 0.27
Zinc 0.60
Nickel 0.23
Barium 0.05
Other elements and chemicals 6.44

(c) Create a circle graph for this data.

1. The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.
State Percentage
Alabama 23
Arizona 18
Arkansas 36
California 30
Connecticut 23
Delaware 31
District of Columbia 8
Florida 40
Georgia 33
Hawaii 25
Illinois 28
Indiana 23
Iowa 32
Kansas 11
Kentucky 28
Louisiana 14
Maine 41
Maryland 29
Massachusetts 33
Michigan 25
Minnesota 42
Mississippi 13
Missouri 33
Montana 5
New Hampshire 25
New Jersey 45
New Mexico 12
New York 39
North Carolina 26
North Dakota 21
Ohio 19
Oklahoma 12
Oregon 28
Pennsylvania 26
Rhode Island 23
South Carolina 34
South Dakota 42
Tennessee 40
Utah 19
Vermont 30
Virginia 35
Washington 48
West Virginia 20
Wisconsin 36
Wyoming 5

(a) Create a dot plot for this data.

(b) Discuss the shape, center, and spread of this distribution.

(c) Create a stem-and-leaf plot for the data.

(d) Use your stem-and-leaf plot to find the median percentage for this data.

1. Identify the important features of the shape of each of the following distributions.

Questions 4-7 refer to the following dot plots:

1. Identify the overall shape of each distribution.
2. How would you characterize the center(s) of these distributions?
3. Which of these distributions has the smallest standard deviation?
4. Which of these distributions has the largest standard deviation?
5. In question 2, you looked at the percentage of waste recycled in each state. Do you think there is a relationship between the percentage recycled and the total amount of waste that a state generates? Here are the data, including both variables.
State Percentage Total Amount of Municipal Waste in Thousands of Tons
Alabama 23 5549
Arizona 18 5700
Arkansas 36 4287
California 30 45000
Connecticut 23 2950
Delaware 31 1189
District of Columbia 8 246
Florida 40 23617
Georgia 33 14645
Hawaii 25 2125
Illinois 28 13386
Indiana 23 7171
Iowa 32 3462
Kansas 11 4250
Kentucky 28 4418
Louisiana 14 3894
Maine 41 1339
Maryland 29 5329
Massachusetts 33 7160
Michigan 25 13500
Minnesota 42 4780
Mississippi 13 2360
Missouri 33 7896
Montana 5 1039
New Hampshire 25 1200
New Jersey 45 8200
New Mexico 12 1400
New York 39 28800
North Carolina 26 9843
North Dakota 21 510
Ohio 19 12339
Oklahoma 12 2500
Oregon 28 3836
Pennsylvania 26 9440
Rhode Island 23 477
South Carolina 34 8361
South Dakota 42 510
Tennessee 40 9496
Utah 19 3760
Vermont 30 600
Virginia 35 9000
Washington 48 6527
West Virginia 20 2000
Wisconsin 36 3622
Wyoming 5 530

(a) Identify the variables in this example, and specify which one is the explanatory variable and which one is the response variable.

(b) How much municipal waste was created in Illinois?

(c) Draw a scatterplot for this data.

(d) Describe the direction and strength of the association between the two variables.

1. The following line graph shows the recycling rates of two different types of plastic bottles in the US from 1995 to 2001.
1. Explain the general trends for both types of plastics over these years.
2. What was the total change in PET bottle recycling from 1995 to 2001?
3. Can you think of a reason to explain this change?
4. During what years was this change the most rapid?

References

National Geographic, January 2008. Volume 213 No.1

$^1$http://www.etoxics.org/site/PageServer?pagename=svtc_global_ewaste_crisis'

Technology Notes: Scatterplots on the TI-83/84 Graphing Calculator

Press [STAT][ENTER], and enter the following data, with the explanatory variable in L1 and the response variable in L2. Next, press [2ND][STAT-PLOT] to enter the STAT-PLOTS menu, and choose the first plot.

Change the settings to match the following screenshot:

Date Created:

Feb 23, 2012

Dec 15, 2014
Files can only be attached to the latest version of None