2.2: Common Graphs and Data Plots
Learning Objectives
- Identify and translate data sets to and from a bar graph and a pie graph.
- Identify and translate data sets to and from a dot plot.
- Identify and translate data sets to and from a stem-and-leaf plot.
- Identify and translate data sets to and from a scatterplot and a line graph.
- Identify graph distribution shapes as skewed or symmetric and understand the basic implication of these shapes.
- Compare distributions of univariate data (shape, center, spread, and outliers).
Introduction
In this section we will continue to investigate the different types of graphs that can be used to interpret a data set. In addition to a few additional ways to represent single variable numerical variables, we will also cover a couple of methods for display categorical variables and an introduction to using a scatterplot and line graph to show the relationship between two variables. Continued emphasis will be placed on what can be learned about the data by describing the shape, center and spread of the distributions. We will also begin to compare the different graphical representations in terms of what additional information each can or cannot give about a data set.
Categorical Variables: Bar Graphs and Pie Graphs
E-Waste and Bar Graphs
We live in an age of unprecedented access to increasingly sophisticated and affordable personal technology. Cell phones, computers, and televisions now improve so rapidly that, while they may still be in working condition, the drive to make use of the latest technological breakthroughs leads many to discard usable electronic equipment. Much of that ends up in a landfill where the chemicals from batteries and other electronics add toxins to the environment. Approximately of the electronics discarded in the United States is also exported to third world countries where it is disposed of under generally hazardous conditions by unprotected workers. The following table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005.
Electronic Equipment | Thousands of Tons Discarded |
---|---|
Cathode Ray Tube (CRT) TV’s | |
CRT Monitors | |
Printers, Keyboards, Mice | |
Desktop Computers | |
Laptop Computers | |
Projection TV’s | |
Cell Phones | |
LCD Monitors |
Figure: Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.1 , pg 73.)
The type of electronic equipment is a categorical variable and therefore this data can easily be represented using the bar graph below:
While this looks very similar to a histogram, the bars in a bar graph usually are separated slightly. Histograms are used to show a range of continuous, numerical data. Even if we “pushed” the bars together, the space between them has no meaning, the graph is just a series of disjoint categories.
Please note that discussions of shape, center, and spread have no meaning for a bar graph and it is not, in fact, even appropriate to refer to this graph as a distribution. For example, some students misinterpret a graph like this by saying it is skewed right. If we rearranged the categories in a different order, the same data set could be made to look skewed left. Do not try to infer any of these concepts from a bar graph!
Pie Graphs
Usually, data that can be represented in a bar graph can also be shown using a pie graph (also commonly called a circle graph or pie chart). In this representation, we convert the count into a percentage so we can show each category relative to the total. Each percentage is then converted into a proportionate sector of the circle. To make this conversion, simply multiply the percentage by 360, which is the total number of degrees in a circle.
Here is a table with the percentages and the approximate angle measure of each sector:
Electronic Equipment |
Thousands of Tons Discarded |
Percentage of Total Discarded |
Angle Measure of Circle Sector |
---|---|---|---|
Cathode Ray Tube (CRT) TV’s |
|||
CRT Monitors | |||
Printers, Keyboards, Mice |
|||
Desktop Computers |
|||
Laptop Computers |
|||
Projection TV’s | |||
Cell Phones | |||
LCD Monitors |
And here is the completed pie graph:
Numerical Variables: Dot Plots
A dot plot is one of the simplest ways to represent numerical data. After choosing an appropriate scale on the axes, each data point is plotted as a single dot. Multiple points at the same value are stacked on top of each other using equal spacing to help convey the shape and center.
For example, here is data from the percentage of paper packaging manufactured from recycled materials for a select group of countries.
Country | of Paper Packaging Recycled |
---|---|
Estonia | |
New Zealand | |
Poland | |
Cyprus | |
Portugal | |
United States | |
Italy | |
Spain | |
Australia | |
Greece | |
Finland | |
Ireland | |
Netherlands | |
Sweden | |
France | |
Germany | |
Austria | |
Belgium | |
Japan |
The dot plot for this data would look like this:
Notice that this data is centered around a manufacturing rate using recycled materials of between and It is spread from up to , and appear very roughly symmetric, perhaps even slightly skewed left. Dot plots have the advantage of showing all the data points and giving a quick and easy snapshot of the shape, center, and spread. Dot plots are not much help when there is little repetition in the data. They can also be very tedious if you are creating them by hand with large data sets, though computer software can make quick and easy work of creating dot plots from such data sets.
Numerical Variables: Stem-and-Leaf Plots
One of the shortcomings of dot plots is that they do not show the actual values of the data, you have to read or infer them from the graph. From the previous example, you might have been able to guess that the lowest value is , but you would have to look in the data table itself to know for sure. A stem-and-leaf plot is a similar plot in which it is much easier to read the actual data values. In a stem-and-leaf plot, each data value is represented by two digits: the stem and the leaf. In this example it makes sense to use the ten’s digits for the stems and the one’s digits for the leaves. The stems are on the left of a dividing line as follows:
Once the stems are decided, the leaves representing the one’s digit and are listed in numerical order from left to right.
It is important to explain the meaning of the data in the plot for someone who is viewing it without seeing the original data. For example, you could place the following sentence at the bottom of the chart:
Note: means and are the two values in the .
If you could rotate this plot on its side, you would see the similarities with the dot plot. The general shape and center of the plot is easily found and we know exactly what each point represents. This plot also shows the slight skewing to the left that we suspected from the dot plot. Stem plots can be difficult to create depending on the numerical qualities and the spread of the data. If the data contains more than two digits, you will need to remove some of the information by rounding. Data that has large gaps between values can also make the stem plot hard to create and less useful when interpreting the data.
Back-to-Back Stem Plots
Stem plots can also be a useful tool for comparing two distributions when placed next to each other or what is commonly called "back-to-back".
In the previous example we looked at recycling in paper packaging. Here is data from the same countries and their percentages of recycled material used to manufacture glass packaging.
Country |
of Glass Packaging Recycled |
---|---|
Cyprus | |
United States | |
Poland | |
Greece | |
Portugal | |
Spain | |
Australia | |
Ireland | |
Italy | |
Finland | |
France | |
Estonia | |
New Zealand | |
Netherlands | |
Germany | |
Austria | |
Japan | |
Belgium | |
Sweden |
In a back-to-back stem plot, one of the distributions simply works off the left side of the stems. In this case, the spread of the glass distribution is wider, so we will have to add a few extra stems. Even if there is no data in a stem, you must include it to preserve the spacing or you will not get an accurate picture of the shape and spread.
We had already mentioned that the spread was larger in the glass distribution and it is easy to see this in the comparison plot. You can also see that the glass distribution is more symmetric and is centered lower (around the mid-) which seems to indicate that overall, these countries manufacture a smaller percentage of glass from recycled material than they do paper. It is interesting to note in this data set that Sweden actually imports glass from other countries for recycling, so their effective percentage is actually more than
Bivariate Data: Scatterplots and Line Plots
Bivariate simply means two variables. All our previous work was with univariate, or single-variable data. The goal of examining bivariate data is usually to show some sort of relationship or association between the two variables. In the previous example, we looked at recycling rates for paper packaging and glass. It would be interesting to see if there is a predictable relationship between the percentages of each material that a country recycles. Here is a data table that includes both percentages.
Country | of Paper Packaging Recycled | of Glass Packaging Recycled |
---|---|---|
Estonia | ||
New Zealand | ||
Poland | ||
Cyprus | ||
Portugal | ||
United States | ||
Italy | ||
Spain | ||
Australia | ||
Greece | ||
Finland | ||
Ireland | ||
Netherlands | ||
Sweden | ||
France | ||
Germany | ||
Austria | ||
Belgium | ||
Japan |
Figure: Paper and Glass Packaging Recycling Rates for countries
Scatterplots
We will place the paper recycling rates on the horizontal axis, and the glass on the vertical axis. Next, plot a point that shows each country’s rate of recycling for the two materials. This series of disconnected points is referred to as a scatterplot.
What can we learn from plotting the data in this manner? Remember that one of the things you saw from the stem and leaf plot is that in general, a country’s recycling rate for glass is lower than its paper recycling rate. On the next graph we have plotted a line that represents paper and recycling rates being equal. If all the countries had the same rates, each point in the scatterplot would be on the line. Because most of the points are actually below this line, you can see that the glass rate is lower than would be expected if they were similar.
In univariate data, we are interested primarily in the ideas of shape, center, and spread to initially characterize a data set. For bivariate data, we will also discuss three important characteristics that are slightly different; shape, direction, and strength, to inform us about the association between the two variables. We will save formal discussions of these ideas, as well as statistics to quantify them, for a later chapter, but direction and strength are easy to introduce in this example. The easiest way to describe these traits for this scatterplot is to think of the data as a “cloud.” If you draw an ellipse around the data, the general trend is that the ellipse is rising from left to right.
Data that is oriented in this manner is said to have a positive linear association. That is, as one variable increases, the other variable also increases. In this example, it is mostly true that country’s with higher paper recycling rates have higher glass recycling rates. This is similar to a concept of slope in Algebra. Lines that rise in this direction have a positive slope, and lines that trend downward from left to right have a negative slope. If the ellipse cloud was trending down in this manner, we would say the data had a negative linear association. For example, we might expect this type of relationship if we graphed a country’s glass recycling rate with the percentage of glass that ends up in a landfill. As the recycling rate increases, the landfill percentage would have to decrease.
The ellipse cloud also gives us some information about the strength of the linear association. If there were a strong linear relationship between glass and paper recycling rates, the cloud of data would be much longer than it is wide. Long and narrow ellipses mean strong linear association, shorter and wider one’s show a weaker linear relationship. In this example, there are some countries in which the glass and paper recycling rates do not seem to be related.
New Zealand, Estonia, and Sweden (circled in yellow) have much lower paper recycling rates than their glass rates, and Austria (circled in green) is an example of a country with a much lower glass rate than their paper rate. These data points are spread away from the rest of the data enough to make the ellipse much wider, therefore weakening the association between the variables.
Explanatory and Response Variables
In this example, there was really no compelling reason to put paper on the horizontal axis and glass on the vertical. We could have learned the same information about the plot if we had switched those variables. In many data sets, however, the variables are often related in such a way that one variable appears to have an impact on the other. In the last lesson, we examined countries that are the top consumers of bottled water per person. If we compared this to the amount of plastics that these countries are disposing in landfills, it is natural to think that a higher rate of drinking bottled water could lead to a response in the amount of plastic waster created in that country. In this case we would refer to the bottled water consumed as the explanatory variable (also referred to in science and math as the independent variable). The explanatory variable should be placed on the horizontal axis. The amount of plastic waste is called the response variable (also referred to in science and math as the dependent variable), which be placed on the vertical axis. There are most likely other variables involved, like the total population, recycling rate, and consumption of other plastics, so we are not implying that the bottled water consumption is the sole cause of change in plastic waste, and without actual data it is difficult to even comment on the strength of the relationship, but it makes sense to look at the general relationship in these terms. It is very natural to think of this as a cause and effect relationship, though you will learn in a later chapter that it is very dangerous to assume such a relationship without performing a properly controlled statistical experiment.
Line Plots
The following data set shows the change in the total amount of municipal waste generated in the United States during the 1990’s.
Year | Municipal Waste Generated (Millions of Tons) |
---|---|
1990 | |
1991 | |
1992 | |
1993 | |
1994 | |
1995 | |
1996 | |
1997 | |
1998 |
Figure: Total Municipal Waste Generated in the US by Year in Millions of Tons. Source: http://www.zerowasteamerica.org/MunicipalWasteManagementReport1998.htm
In this example, the time in years is the explanatory variable and the amount of municipal waste is the response variable. It is not the passage of time that causes our waste to increase. Other factors such as population growth, economic conditions, and societal habits and attitudes contribute as causes. But it would not make sense to view the relationship between time and municipal waste in the opposite direction.
When one of the variables is time, it will almost always be the explanatory variable. Because time is a continuous variable and we are very often interested in the change a variable exhibits over a period of time, there is some meaning to the connection between the points in a plot involving time as an explanatory variable. In this case we use a line plot. A line plot is simply a scatterplot in which we connect successive chronological observations with a line segment to give more information about how the data is changing over a period of time. Here is the line plot for the US Municipal Waste data:
It is easy to see general trends from this type of plot. For example, we can spot the year in which the most dramatic increase occurred by looking at the steepest line (1990). We can also spot the years in which the waste output decreased and/or remained about the same (1991 and 1996). It would be interesting to investigate some possible reasons for the behaviors of these individual years.
Scatterplots and Line Plots on the Graphing Calculator
Scatterplots
Enter the data from the scatterplot example of recycling rates. Place the paper rates in and the glass rates in
Next, press 2ND [STAT-PLOT] to enter the STAT-PLOTS menu and choose the first plot.
Change the settings to match the following screenshot:
This selects a scatterplot with the explanatory variable in and the response variable in . In order to see the points better, you should choose either the square or the plus sign for the mark. Finally, set an appropriate Window to match the data. In this case, we looked at our lowest and highest percentages in each variable, and added a bit of room to create a pleasant window. Press [GRAPH] to see the result, Which is shown below.
Line Plots
Your graphing calculator will also draw a line plot and the process is almost identical to that for creating a scatterplot. Enter the data from the US Municipal waste example into your lists. The only change that you need to make is to choose a line plot in the Plot1 menu.
Set an appropriate window, and graph the resulting plot.
Lesson Summary
Bar graphs are used to represent categorical data in a manner that looks similar to, but is not the same as a histogram. Pie (or circle) graphs are also useful ways to display categorical variables, especially when it is important to show how percentages of an entire data set fit into individual categories. A dot plot is a convenient way to represent univariate numerical data by plotting individual dots along a single number line to represent each value. They are especially useful in giving a quick impression of the shape, center, and spread of the data set, but are tedious to create by hand when dealing with large data sets. Stem and leaf plots show similar information with the added benefit of showing the actual data values. Bivariate data can be represented using a scatterplot to show what, if any, association there is between the two variables. Usually one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be the vertical axis. Each point is plotted individually on a scatterplot. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct shape with little variation from that shape in the individual points, or weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable also increase, the data has a positive association. If the response variable generally decreases as the explanatory variable increases, the data has a negative association. In a line graph, there is significance to the change between consecutive points so those points are connected. Line graphs are used often when the explanatory variable is time.
Points to Consider
- What characteristics of a data set make it easier or harder to represent it using dot plots, stem and leaf plots, or histograms?
- Which plots are most useful to interpret the ideas of shape, center, and spread?
- What effects does the shape of a data set have on the statistical measures of center and spread?
Review Questions
- Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately . Some of the more hazardous substances like Mercury have been included in the “other” category because they occur in relatively small amounts that are still dangerous and toxic.
Material | Kilograms |
---|---|
Plastics | |
Lead | |
Aluminum | |
Iron | |
Copper | |
Tin | |
Zinc | |
Nickel | |
Barium | |
Other elements and chemicals |
Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov.in/envisnew/INDUSTRIAL%20SOLID%20WASTE.htm
(a) Create a bar graph for this data.
(b) Complete the chart below to show the approximate percent of the total weight for each material.
Material | Kilograms | Approximate Percentage of Total Weight |
---|---|---|
Plastics | ||
Lead | ||
Aluminum | ||
Iron | ||
Copper | ||
Tin | ||
Zinc | ||
Nickel | ||
Barium | ||
Other elements and chemicals |
(c) Create a circle graph for this data.
- The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas.
State | Percentage |
---|---|
Alabama | |
Alaska | |
Arizona | |
Arkansas | |
California | |
Colorado | |
Connecticut | |
Delaware | |
District of Columbia | |
Florida | |
Georgia | |
Hawaii | |
Illinois | |
Indiana | |
Iowa | |
Kansas | |
Kentucky | |
Louisiana | |
Maine | |
Maryland | |
Massachusetts | |
Michigan | |
Minnesota | |
Mississippi | |
Missouri | |
Montana | |
Nebraska | |
Nevada | |
New Hampshire | |
New Jersey | |
New Mexico | |
New York | |
North Carolina | |
North Dakota | |
Ohio | |
Oklahoma | |
Oregon | |
Pennsylvania | |
Rhode Island | |
South Carolina | |
South Dakota |