2.1: Histograms and Frequency Distributions
Learning Objectives
- Read and make frequency tables for a data set.
- Identify and translate data sets to and from a histogram, a relative frequency histogram, and a frequency polygon.
- Identify histogram distribution shapes as skewed or symmetric and understand the basic implications of these shapes.
- Identify and translate data sets to and from an ogive plot (cumulative distribution function).
Introduction
Charts and graphs of various types, when created carefully, can provide instantaneous important information about a data set without calculating, or even having knowledge of, various statistical measures. This chapter will concentrate on some of the more common visual presentations of data.
Frequency Tables
The earth has seemed so large in scope for thousands of years that it is only recently that many people have begun to take seriously the idea that we live on a planet of limited and dwindling resources. This is something that residents of the Galapagos Islands are also beginning to understand. Because of its isolation and lack of resources to support large, modernized populations of humans, the problems that we face on a global level are magnified in the Galapagos. Basic human resources such as water, food, fuel, and building materials must all be brought in to the islands. More problematically, the waste products must either be disposed of in the islands, or shipped somewhere else at a prohibitive cost. As the human population grows exponentially, the Islands are confronted with the problem of what to do with all the waste. In most communities in the United States, it is easy for many to put out the trash on the street corner each week and perhaps never worry about where that trash is going. In the Galapagos, the desire to protect the fragile ecosystem from the impacts of human waste is more urgent and is resulting in a new focus on renewing, reducing, and reusing materials as much as possible. There have been recent positive efforts to encourage recycling programs.
The Recycling Center on Santa Cruz in the Galapagos turns all the recycled glass into pavers that are used for the streets in Puerto Ayora.
It is not easy to bury tons of trash in solid volcanic rock. The sooner we realize that we are in the same position of limited space and that we have a need to preserve our global ecosystem, the more chance we have to save not only the uniqueness of the Galapagos Islands, but that of our own communities. All of the information in this chapter is focused around the issues and consequences of our recycling habits, or lack thereof!
Example: Water, Water, Everywhere!
Bottled water consumption worldwide has grown, and continues to grow at a phenomenal rate. According to the Earth Policy Institute, 154 billion gallons were produced in 2004. While there are places in the world where safe water supplies are unavailable, most of the growth in consumption has been due to other reasons. The largest consumer of bottled water is the United States, which arguably could be the country with the best access to safe, convenient, and reliable sources of tap water. The large volume of toxic waste that is generated by the plastic bottles and the small fraction of the plastic that is recycled create a considerable environmental hazard. In addition, huge volumes of carbon emissions are created when these bottles are manufactured using oil and transported great distances by oil-burning vehicles.
Example: Take an informal poll of your class. Ask each member of the class, on average, how many beverage bottles they use in a week. Once you collect this data, the first step is to organize it so it is easier to understand. A frequency table is a common starting point. Frequency tables simply display each value of the variable, and the number of occurrences (the frequency) of each of those values. In this example, the variable is the number of plastic beverage bottles of water consumed each week.
Consider the following raw data:
6, 4, 7, 7, 8, 5, 3, 6, 8, 6, 5, 7, 7, 5, 2, 6, 1, 3, 5, 4, 7, 4, 6, 7, 6, 6, 7, 5, 4, 6, 5, 3
Here are the correct frequencies using the imaginary data presented above:
Figure: Imaginary Class Data on Water Bottle Usage
Number of Plastic Beverage Bottles per Week | Frequency |
---|---|
1 | 1 |
2 | 1 |
3 | 3 |
4 | 4 |
5 | 6 |
6 | 8 |
7 | 7 |
8 | 2 |
When creating a frequency table, it is often helpful to use tally marks as a running total to avoid missing a value or over-representing another.
Number of Plastic Beverage Bottles per Week | Tally | Frequency |
---|---|---|
1 | \begin{align*}{\color{red} | }\end{align*} | 1 |
2 | \begin{align*}{\color{red} | }\end{align*} | 1 |
3 | \begin{align*}{\color{red} | | | }\end{align*} | 3 |
4 | \begin{align*}{\color{red} | | | | }\end{align*} | 4 |
5 | \begin{align*}{\color{red} \bcancel{ | | | | } \ | }\end{align*} | 6 |
6 | \begin{align*}{\color{red} \bcancel{ | | | | } \ | | | }\end{align*} | 8 |
7 | \begin{align*}{\color{red} \bcancel{ | | | | } \ | | }\end{align*} | 7 |
8 | \begin{align*}{\color{red} | | }\end{align*} | 2 |
The following data set shows the countries in the world that consume the most bottled water per person per year.
Country | Liters of Bottled Water Consumed per Person per Year |
---|---|
Italy | 183.6 |
Mexico | 168.5 |
United Arab Emirates | 163.5 |
Belgium and Luxembourg | 148.0 |
France | 141.6 |
Spain | 136.7 |
Germany | 124.9 |
Lebanon | 101.4 |
Switzerland | 99.6 |
Cyprus | 92.0 |
United States | 90.5 |
Saudi Arabia | 87.8 |
Czech Republic | 87.1 |
Austria | 82.1 |
Portugal | 80.3 |
Figure: Bottled Water Consumption per Person in Leading Countries in 2004. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm
These data values have been measured at the ratio level. There is some flexibility required in order to create meaningful and useful categories for a frequency table. The values range from 80.3 liters to 183 liters. By examining the data, it seems appropriate for us to create our frequency table in groups of 10. We will skip the tally marks in this case, because the data values are already in numerical order, and it is easy to see how many are in each classification.
A bracket, '[' or ']', indicates that the endpoint of the interval is included in the class. A parenthesis, '(' or ')', indicates that the endpoint is not included. It is common practice in statistics to include a number that borders two classes as the larger of the two numbers in an interval. For example, \begin{align*}[80-90)\end{align*} means this classification includes everything from 80 and gets infinitely close to, but not equal to, 90. 90 is included in the next class, \begin{align*}[90 -100)\end{align*}.
Liters per Person | Frequency |
---|---|
\begin{align*}[80-90)\end{align*} | 4 |
\begin{align*}[90-100)\end{align*} | 3 |
\begin{align*}[100-110)\end{align*} | 1 |
\begin{align*}[110-120)\end{align*} | 0 |
\begin{align*}[120-130)\end{align*} | 1 |
\begin{align*}[130-140)\end{align*} | 1 |
\begin{align*}[140-150)\end{align*} | 2 |
\begin{align*}[150-160)\end{align*} | 0 |
\begin{align*}[160-170)\end{align*} | 2 |
\begin{align*}[170-180)\end{align*} | 0 |
\begin{align*}[180-190)\end{align*} | 1 |
Figure: Completed Frequency Table for World Bottled Water Consumption Data (2004)
Histograms
Once you can create a frequency table, you are ready to create our first graphical representation, called a histogram. Let's revisit our data about student bottled beverage habits.
Number of Plastic Beverage Bottles per Week | Frequency |
---|---|
1 | 1 |
2 | 1 |
3 | 3 |
4 | 4 |
5 | 6 |
6 | 8 |
7 | 7 |
8 | 2 |
Here is the same data in a histogram:
In this case, the horizontal axis represents the variable (number of plastic bottles of water consumed), and the vertical axis is the frequency, or count. Each vertical bar represents the number of people in each class of ranges of bottles. For example, in the range of consuming \begin{align*}[1 -2)\end{align*} bottles, there is only one person, so the height of the bar is at 1. We can see from the graph that the most common class of bottles used by people each week is the \begin{align*}[6-7)\end{align*} range, or six bottles per week.
A histogram is for numerical data. With histograms, the different sections are referred to as bins. Think of a column, or bin, as a vertical container that collects all the data for that range of values. If a value occurs on the border between two bins, it is commonly agreed that this value will go in the larger class, or the bin to the right. It is important when drawing a histogram to be certain that there are enough bins so that the last data value is included. Often this means you have to extend the horizontal axis beyond the value of the last data point. In this example, if we had stopped the graph at 8, we would have missed that data, because the 8's actually appear in the bin between 8 and 9. Very often, when you see histograms in newspapers, magazines, or online, they may instead label the midpoint of each bin. Some graphing software will also label the midpoint of each bin, unless you specify otherwise.
On the Web
http://illuminations.nctm.org/ActivityDetail.aspx?ID=78 Here you can change the bin width and explore how it effects the shape of the histogram.
Relative Frequency Histogram
A relative frequency histogram is just like a regular histogram, but instead of labeling the frequencies on the vertical axis, we use the percentage of the total data that is present in that bin. For example, there is only one data value in the first bin. This represents \begin{align*}\frac{1}{32}\end{align*}, or approximately 3%, of the total data. Thus, the vertical bar for the bin extends upward to 3%.
Frequency Polygons
A frequency polygon is similar to a histogram, but instead of using bins, a polygon is created by plotting the frequencies and connecting those points with a series of line segments.
To create a frequency polygon for the bottle data, we first find the midpoints of each classification, plot a point at the frequency for each bin at the midpoint, and then connect the points with line segments. To make a polygon with the horizontal axis, plot the midpoint for the class one greater than the maximum for the data, and one less than the minimum.
Here is a frequency polygon constructed directly from the previously-shown histogram:
Here is the frequency polygon in finished form:
Frequency polygons are helpful in showing the general overall shape of a distribution of data. They can also be useful for comparing two sets of data. Imagine how confusing two histograms would look graphed on top of each other!
Example: It would be interesting to compare bottled water consumption in two different years. Two frequency polygons would help give an overall picture of how the years are similar, and how they are different. In the following graph, two frequency polygons, one representing 1999, and the other representing 2004, are overlaid. 1999 is in red, and 2004 is in green.
It appears there was a shift to the right in all the data, which is explained by realizing that all of the countries have significantly increased their consumption. The first peak in the lower-consuming countries is almost identical in the two frequency polygons, but it increased by 20 liters per person in 2004. In 1999, there was a middle peak, but that group shifted significantly to the right in 2004 (by between 40 and 60 liters per person). The frequency polygon is the first type of graph we have learned about that makes this type of comparison easier.
Cumulative Frequency Histograms and Ogive Plots
Very often, it is helpful to know how the data accumulate over the range of the distribution. To do this, we will add to our frequency table by including the cumulative frequency, which is how many of the data points are in all the classes up to and including a particular class.
Number of Plastic Beverage Bottles per Week | Frequency | Cumulative Frequency |
---|---|---|
1 | 1 | 1 |
2 | 1 | 2 |
3 | 3 | 5 |
4 | 4 | 9 |
5 | 6 | 15 |
6 | 8 | 23 |
7 | 7 | 30 |
8 | 2 | 32 |
Figure: Cumulative Frequency Table for Bottle Data
For example, the cumulative frequency for 5 bottles per week is 15, because 15 students consumed 5 or fewer bottles per week. Notice that the cumulative frequency for the last class is the same as the total number of students in the data. This should always be the case.
If we drew a histogram of the cumulative frequencies, or a cumulative frequency histogram, it would look as follows:
A relative cumulative frequency histogram would be the same, except that the vertical bars would represent the relative cumulative frequencies of the data:
Number of Plastic Beverage Bottles per Week | Frequency | Cumulative Frequency | Relative Cumulative Frequency (%) |
---|---|---|---|
1 | 1 | 1 | 3.1 |
2 | 1 | 2 | 6.3 |
3 | 3 | 5 | 15.6 |
4 | 4 | 9 | 28.1 |
5 | 6 | 15 | 46.9 |
6 | 8 | 23 | 71.9 |
7 | 7 | 30 | 93.8 |
8 | 2 | 32 | 100 |
Figure: Relative Cumulative Frequency Table for Bottle Data
Remembering what we did with the frequency polygon, we can remove the bins to create a new type of plot. In the frequency polygon, we connected the midpoints of the bins. In a relative cumulative frequency plot, we use the point on the right side of each bin.
The reason for this should make a lot of sense: when we read this plot, each point should represent the percentage of the total data that is less than or equal to a particular value, just like in the frequency table. For example, the point that is plotted at 4 corresponds to 15.6%, because that is the percentage of the data that is less than or equal to 3. It does not include the 4's, because they are in the bin to the right of that point. This is why we plot a point at 1 on the horizontal axis and at 0% on the vertical axis. None of the data is lower than 1, and similarly, all of the data is below 9. Here is the final version of the plot:
This plot is commonly referred to as an ogive plot. The name ogive comes from a particular pointed arch originally present in Arabic architecture and later incorporated in Gothic cathedrals. Here is a picture of a cathedral in Ecuador with a close-up of an ogive-type arch:
If a distribution is symmetric and mound shaped, then its ogive plot will look just like the shape of one half of such an arch.
Shape, Center, Spread
In the first chapter, we introduced measures of center and spread as important descriptors of a data set. The shape of a distribution of data is very important as well. Shape, center, and spread should always be your starting point when describing a data set.
Referring to our imaginary student poll on using plastic beverage containers, we notice that the data are spread out from 0 to 9. The graph for the data illustrates this concept, and the range quantifies it. Look back at the graph and notice that there is a large concentration of students in the 5, 6, and 7 region. This would lead us to believe that the center of this data set is somewhere in this area. We use the mean and/or median to measure central tendency, but it is also important that you see that the center of the distribution is near the large concentration of data. This is done with shape.
Shape is harder to describe with a single statistical measure, so we will describe it in less quantitative terms. A very important feature of this data set, as well as many that you will encounter, is that it has a single large concentration of data that appears like a mountain. A data set that is shaped in this way is typically referred to as mound-shaped. Mound-shaped data will usually look like one of the following three pictures:
Think of these graphs as frequency polygons that have been smoothed into curves. In statistics, we refer to these graphs as density curves. The most important feature of a density curve is symmetry. The first density curve above is symmetric and mound-shaped. Notice the second curve is mound-shaped, but the center of the data is concentrated on the left side of the distribution. The right side of the data is spread out across a wider area. This type of distribution is referred to as skewed right. It is the direction of the long, spread out section of data, called the tail, that determines the direction of the skewing. For example, in the \begin{align*}3^{\text{rd}}\end{align*} curve, the left tail of the distribution is stretched out, so this distribution is skewed left. Our student bottle data set has this skewed-left shape.
Lesson Summary
A frequency table is useful to organize data into classes according to the number of occurrences, or frequency, of each class. Relative frequency shows the percentage of data in each class. A histogram is a graphical representation of a frequency table (either actual or relative frequency). A frequency polygon is created by plotting the midpoint of each bin at its frequency and connecting the points with line segments. Frequency polygons are useful for viewing the overall shape of a distribution of data, as well as comparing multiple data sets. For any distribution of data, you should always be able to describe the shape, center, and spread. A data set that is mound shaped can be classified as either symmetric or skewed. Distributions that are skewed left have the bulk of the data concentrated on the higher end of the distribution, and the lower end, or tail, of the distribution is spread out to the left. A skewed-right distribution has a large portion of the data concentrated in the lower values of the variable, with the tail spread out to the right. A relative cumulative frequency plot, or ogive plot, shows how the data accumulate across the different values of the variable.
Points to Consider
- What characteristics of a data set make it easier or harder to represent it using frequency tables, histograms, or frequency polygons?
- What characteristics of a data set make representing it using frequency tables, histograms, frequency polygons, or ogive plots more or less useful?
- What effects does the shape of a data set have on the statistical measures of center and spread?
- How do you determine the most appropriate classification to use for a frequency table or the bin width to use for a histogram?
Review Questions
- Lois was gathering data on the plastic beverage bottle consumption habits of her classmates, but she ran out of time as class was ending. When she arrived home, something had spilled in her backpack and smudged the data for the 2's. Fortunately, none of the other values was affected, and she knew there were 30 total students in the class. Complete her frequency table.
Number of Plastic Beverage Bottles per Week | Tally | Frequency |
---|---|---|
1 | \begin{align*}{\color{red} | |}\end{align*} | |
2 | ||
3 | \begin{align*}{\color{red} | | |}\end{align*} | |
4 | \begin{align*}{\color{red} | | }\end{align*} | |
5 | \begin{align*}{\color{red} | | | }\end{align*} | |
6 | \begin{align*}{\color{red}\bcancel{ | | | | } \ | | }\end{align*} | |
7 | \begin{align*}{\color{red}\bcancel{| | | | }\ | }\end{align*} | |
8 | \begin{align*}{\color{red} | }\end{align*} |
- The following frequency table contains exactly one data value that is a positive multiple of ten. What must that value be?
Class | Frequency |
---|---|
\begin{align*}[0 - 5)\end{align*} | 4 |
\begin{align*}[5 - 10)\end{align*} | 0 |
\begin{align*}[10 - 15)\end{align*} | 2 |
\begin{align*}[15 - 20)\end{align*} | 1 |
\begin{align*}[20 - 25)\end{align*} | 0 |
\begin{align*}[25 - 30)\end{align*} | 3 |
\begin{align*}[30 - 35)\end{align*} | 0 |
\begin{align*}[35 - 40)\end{align*} | 1 |
(a) 10
(b) 20
(c) 30
(d) 40
(e) There is not enough information to determine the answer.
- The following table includes the data from the same group of countries from the earlier bottled water consumption example, but is for the year 1999, instead.
Country | Liters of Bottled Water Consumed per Person per Year |
---|---|
Italy | 154.8 |
Mexico | 117.0 |
United Arab Emirates | 109.8 |
Belgium and Luxembourg | 121.9 |
France | 117.3 |
Spain | 101.8 |
Germany | 100.7 |
Lebanon | 67.8 |
Switzerland | 90.1 |
Cyprus | 67.4 |
United States | 63.6 |
Saudi Arabia | 75.3 |
Czech Republic | 62.1 |
Austria | 74.6 |
Portugal | 70.4 |
Figure: Bottled Water Consumption per Person in Leading Countries in 1999. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm
(a) Create a frequency table for this data set.
(b) Create the histogram for this data set.
(c) How would you describe the shape of this data set?
- The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials.
Manufactured Material | Energy Saved (millions of BTU's per ton) |
---|---|
Aluminum Cans | 206 |
Copper Wire | 83 |
Steel Cans | 20 |
LDPE Plastics (e.g., trash bags) | 56 |
PET Plastics (e.g., beverage bottles) | 53 |
HDPE Plastics (e.g., household cleaner bottles) | 51 |
Personal Computers | 43 |
Carpet | 106 |
Glass | 2 |
Corrugated Cardboard | 15 |
Newspaper | 16 |
Phone Books | 11 |
Magazines | 11 |
Office Paper | 10 |
Amount of energy saved by manufacturing different materials using the maximum percentage of recycled material as opposed to using all new material. Source: National Geographic, January 2008. Volume 213 No., pg 82-83.
(a) Complete the frequency table below, including the actual frequency, the relative frequency (round to the nearest tenth of a percent), and the relative cumulative frequency.
(b) Create a relative frequency histogram from your table in part (a).
(c) Draw the corresponding frequency polygon.
(d) Create the ogive plot.
(e) Comment on the shape, center, and spread of this distribution as it relates to the original data. (Do not actually calculate any specific statistics).
(f) Add up the relative frequency column. What is the total? What should it be? Why might the total not be what you would expect?
(g) There is a portion of your ogive plot that should be horizontal. Explain what is happening with the data in this area that creates this horizontal section.
(h) What does the steepest part of an ogive plot tell you about the distribution?
On the Web
http://www.earth-policy.org/Updates/2006/Update51_data.htm
http://en.wikipedia.org/wiki/Ogive
Technology Notes: Histograms on the TI-83/84 Graphing Calculator
To draw a histogram on your TI-83/84 graphing calculator, you must first enter the data in a list. In the home screen, press [2ND][}], and then enter the data separated by commas (see the screen below). When all the data have been entered, press [2ND][}][STO], and then press [2ND][L1][ENTER].
Now you are ready to plot the histogram. Press [2ND][STAT PLOT] to enter the STAT-PLOTS menu. You can plot up to three statistical plots at one time. Choose Plot1. Turn the plot on, change the type of plot to a histogram (see sample screen below), and choose L1. Enter '1' for the Freq by pressing [2ND][A-LOCK] to turn off alpha lock, which is normally on in this menu, because most of the time you would want to enter a variable here. An alternative would be to enter the values of the variables in L1 and the frequencies in L2 as we did in Chapter 1.
Finally, we need to set a window. Press [WINDOW] and enter an appropriate window to display the plot. In this case, 'XSCL' is what determines the bin width. Also notice that the maximum \begin{align*}x\end{align*} value needs to go up to 9 to show the last bin, even though the data values stop at 8. Enter all of the values shown below.
Press [GRAPH] to display the histogram. If you press [TRACE] and then use the left or right arrows to trace along the graph, notice how the calculator uses the notation to properly represent the values in each bin.