2.1: Histograms and Frequency Distributions
Learning Objectives
- Read and make frequency tables for a data set.
- Identify and translate data sets to and from a histogram, a relative frequency histogram, and a frequency polygon.
- Identify histogram distribution shapes as skewed or symmetric and understand the basic implications of these shapes.
- Identify and translate data sets to and from an ogive plot (cumulative distribution function).
Introduction
In chapter 1, we focused on describing data using summary statistics. While this is very useful in analyzing and learning important characteristics of a data set, it is also very important and informative to represent data in some visual format. This is in fact the form in which most people are used to encountering data while engaged in such things as reading newspapers, magazines, food labels, or watching television. Charts and graphs of various types, when created carefully, can provide instantaneous important information about a data set without calculating, or even having knowledge of, various statistical measures. This chapter will concentrate on some of the more common visual presentations of data.
Frequency Tables
A Real Context: Recycling Issues
The earth has seemed so large in scope for thousands of years that it is only recently that many have begun to take seriously the idea that we live on a planet of limited and dwindling resources that is in a sense, and island in the middle of space. This is something that residents of the Galapagos Islands are also beginning to understand on a much more dramatic level. Because of its isolation and lack of resources to support large, modernized populations of humans, the problems that we face on a global level are magnified in the Galapagos, as well as other island cultures. Basic human resources such as water, food, fuel, and building materials, must all be brought in to the islands. More problematically, the waste products must either be disposed of in the islands, or shipped somewhere else at a prohibitive cost. As the human population grows exponentially, the Islands are confronted with the problem of what to do with all the waste. In most communities in the United States, it is easy for many to put out the trash on the street corner each week and perhaps never worry about where that trash is going. In the Galapagos, the desire not protect the fragile ecosystem from the impacts of human waste is more urgent and is resulting in a new focus on renewing, reducing, and reusing materials as much as possible. There have been recent positive efforts to encourage recycling programs.
The Recycling Center on Santa Cruz in the Galapagos turns all the recycled glass into pavers that are used for the streets in Puerto Ayora.
It is not easy to bury tons of trash in solid volcanic rock. The sooner we realize that we are in the same position of limited space and a need to preserve our global ecosystem, the more chance we have to save not only the uniqueness of the Galapagos Islands, but that of our own communities. All of the data in this chapter is focused around the issues and consequences of our recycling habits, or lack thereof!
Water, Water, Everywhere!
Bottled water consumption worldwide has grown, and continues to grow at a phenomenal rate. According to the Earth Policy Institute, gallons were produced in 2004. While there are places in the world where safe water supplies are unavailable, most of the growth in consumption has been due to other reasons. The largest consumer of bottled water is the United States, which arguably could be the country with the best access to safe, convenient, and reliable sources of tap water. The large volume of toxic waste that is generated and the small fraction of it that is recycled create a considerable environmental hazard. In addition, huge volumes of carbon emissions are created when these bottles are manufactured using oil and transported great distances by oil burning vehicles.
One of the reasons for the large increase of bottled beverages has been an increased focus on health and fitness and it has spilled over into all aspects of life. Ask your teacher if they ever had water bottles in their classes when they were students?
Take an informal poll of your class. Ask each member of the class, on average, how many beverage bottles they use in a week. Once you collect this data the first step is to organize it in some way that makes it easier to understand. A frequency table is a common starting point. Frequency tables simply display each value of the variable, and the number of occurrences (the frequency) of each of those values. In this example, the variable is the number of plastic beverage bottles consumed each week. You could use your class data, but let’s use an imaginary class. Here is the raw data:
Because the data is only limited to the numbers through , it is very simple to create a frequency table using those values. For example, here is a table you could use to collect data from your classmates:
Number of Plastic Beverage Bottles per Week | Frequency |
---|---|
Here are the correct frequencies using the imaginary data presented above:
Figure: Imaginary Class Data on Water Bottle Usage
Number of Plastic Beverage Bottles per Week |
Frequency |
---|---|
While this data set is rather simple and small, you can see how much easier it is to interpret the data in this form. One caution about translating raw data into a more helpful visual form is that it is very easy to make a mistake, especially with a larger data set. In this case, it is often helpful to use tally marks as a running total to help construct the table and avoid missing a value or over-representing another.
Number of Plastic Beverage Bottles per Week |
Tally | Frequency |
---|---|---|
This data set could almost be considered categorical and was easy to translate into a frequency table. In many situations, you will need to create your own categories, or classifications. The following data set shows the countries in the world that consume the most bottled water per person per year.
Country |
Liters of Bottled Water Consumed per Person per Year |
---|---|
Italy | |
Mexico | |
United Arab Emirates | |
Belgium and Luxembourg | |
France | |
Spain | |
Germany | |
Lebanon | |
Switzerland | |
Cyprus | |
United States | |
Saudi Arabia | |
Czech Republic | |
Austria | |
Portugal |
Figure: Bottled Water Consumption per Person in Leading Countries in 2004. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm
This data has been measured at the ratio level (see levels of measurement in chapter one), so there is some flexibility required in order to create meaningful and useful categories for a frequency table. The values range from , up to . By examining the data, it might seem appropriate for us to create frequency table by (, etc.) We will skip the tally marks in this case because the data is already in numerical order and it is easy to see how many are in each classification.
Liters per Person | Frequency |
---|---|
Figure: Completed Frequency Table for World Bottled Water Consumption Data(2004)
Notice the mathematical notation used for each classification. A bracket [ or ] indicates that the endpoint of the interval is included in the class. A parentheses ( or ) indicates that the endpoint is not included. What do you do with a number that is in between two classifications? For example, it is unlikely, but possible that a country consumed exactly of bottled water per person. It is intuitive to include this in the, not the , but how would we label the categories? If you wrote and , it would seem as if belongs in both classes. But if you wrote, what would you do with ? It is common practice in statistics to include a number that borders two classes in the larger of the two. So, means this classification includes everything from that gets infinitely close to, but not equal to . Even if the bracket notation is not used, you should always place such values in the higher classification.
Histograms, Not Bar Graphs!
Once you can create a frequency table, you are ready to create our first graphical representation, called a histogram. Let’s revisit our data about student bottled beverage habits.
Number of Plastic Beverage Bottles per Week | Frequency |
---|---|
Figure: Completed Frequency Table for Water Bottle Data
Here is the same data in a histogram:
In this case the horizontal axis represents the variable (number of plastic bottles) and the vertical axis is the frequency or count. Each vertical bar represents the number of people in each class of ranges of bottles (e.g. etc.) We can see from the graph that the most common class of bottles used by people each week is the range, or six bottles per week.
A Histogram is Not a Bar Graph!
Please avoid a mistake of beginning statistics students and do not call this a bar graph! As you will learn later, bar graphs are only for categorical data. A histogram is for numerical data and most often will describe continuous data. With histograms, the different sections are referred to as "bins" rather than "bars." Think of the column, or "bin,", as a vertical container that collects all the data for that range of values.
Just like the frequency table, if a value occurs on the border between two bins, it is commonly agreed that this value will go in the larger class, or the bin to the right. When students are drawing histograms, they sometimes make the error of looking at the last value in the data and stop their horizontal axis at this point. In this example, if we had stopped the graph at 8, we would have missed that data because the 8's actually appear in the bin between 8 and 9. Very often when you see histograms in newspapers, magazines, or online, they may instead label the midpoint of each bin. Some graphing software will also label the midpoints of each bin unless you specify otherwise.
Histograms on the Graphing Calculator
To draw a histogram on your TI-83-family graphing calculator, you must first enter the data in a list. In chapter 1 you used the List Editor. Here is another way to enter data into a list:
In the home screen press 2ND and then enter the data separated by commas (see the screen below). When all the data has been entered, press 2ND [STO] then 2ND [L1].
Now you are ready to plot the histogram. Press 2ND [STAT PLOT] to enter the STAT-PLOTS menu. You can plot up to three statistical plots at one time, choose Plot 1. Turn the plot ON, change the type of plot to a histogram (see sample screen below) and choose L1. Enter “1” for the Freq by pressing 2ND [A-LOCK] to turn off alpha lock, which is normally on in this menu because most of the time you would want to enter a name here. An alternative would be to enter the values of the variables in L1 and the frequencies in L2 as we did in chapter 1.
Finally, we need to set a window. Press [WINDOW] and enter an appropriate window to display the plot. In this case XSCL is what determines the bin width. Also notice that the maximum value needs to go up to to show the last bin, even though the data stops at .
Press [GRAPH] to display the histogram. If you press [TRACE] and then use the left or right arrows to trace along the graph, notice how the calculator uses the notation to properly represent the values in each bin.
It’s All Relative!!
A relative frequency histogram is just like a regular histogram, but instead of labeling the frequencies on the vertical axis, we use the percentage of the total data that is present in that bin. This way the numbers reflect the amount relative to the entire data set.
Frequency Polygons
A frequency polygon is similar to a histogram, but instead of using bins, a polygon is created by plotting the frequencies and connecting those points with a series of line segments.
To create a frequency polygon for the bottle data, we first find the midpoints of each classification, plot a point at the frequency for each bin at the midpoint, and then connect the points with line segments. To make a polygon with the horizontal axis, plot the midpoint for the class one greater than the maximum for the data, and one less than the minimum.
Here is the frequency polygon constructed directly from the histogram.
And here is the frequency polygon in finished form.
Frequency polygons are helpful in showing the general overall shape of a distribution of data. They can also be useful for comparing two sets of data. Imagine how confusing two histograms would look graphed on top of each other!
For example, we looked at the bottled water consumption of the leading countries in the year 2004 and you will work with the data from 1999 at the end of the lesson, but it would be nice to be able to compare the two distributions of data. A frequency polygon would help give an overall picture of how these results are similar and different. In the following graph, the two frequency polygons are overlaid, 1999 in red, and 2004 in green. Can you see any important differences in the way the graph is shaped?
First of all, it appears as if there was a shift to the right in all the data, which is explained by realizing that all of the countries have significantly increased their consumption. The first peak in the lower consuming countries is almost identical but has increased by per person. In 1999 there was a middle peak, but that group showed an even more dramatic increase in 2004 and has shifted significantly to the right (by between and per person). The frequency polygons is the first type of graph we have learned that make this type of comparison easier and we will learn others in later lessons.
The Mantra of Descriptive Statistics: Shape, Center, Spread
In the first chapter we introduced measures of center and spread as important indicators of a data set. We now have the tools to include the shape of a distribution of data as being very important as well. The “big three”: Shape, Center, and Spread should always be your starting point when describing a data set. If a statistician had to wear a uniform, it should probably say: shape, center, and spread.
If you look back at our imaginary student poll on using plastic beverage containers, A first glance would allow us to conclude that the data is spread out from up to . The graph illustrates this concept, and we have a statistic that we used in the first chapter to quantify it: the range. Notice also that there is a larger concentration of students in the and region. This would lead us to believe that the center of this data set is somewhere in that area. We also used statistical measures to quantify this concept such as the mean and the median, but it is important that you “see” the idea of the center of the distribution as being near the large concentration of data.
Shape is harder to describe with a single statistical measure, so we will describe it in less quantitative terms. A very important feature of this data set, as well as many that you will encounter is that it has a single large concentration of data that appears like a mountain. Data that is shaped in this way is typically referred to as mound-shaped. Mound-shaped data will usually look like one of the following three pictures:
Think of these graphs as frequency polygons that have been smoothed into curves. In statistics, we refer to these graphs as density curves. Though the true definition of a density curve will come in a later chapter, we should start to get used to the correct terminology now. The most important feature of the first density curve is symmetry. A concise description of the shape of this distribution therefore, would be symmetric and mound shaped. Notice in the second curve is mound shaped, but the center of the data is concentrated on the left side of the distribution. The right side of the data is spread out across a wider area. This type of distribution is referred to as skewed right. Be careful!! Many beginning statistics students think it would intuitively make sense to refer to the side with the concentration of the data as the direction of the skewing. Instead, it is the direction of the long, spread out section of data, called the tail, that determines the direction of the skewing. For example, in the curve, the left tail of the distribution is stretched out, so this distribution is skewed left. Our student bottle data has this skewed left shape.
Cumulative Frequency Histograms and Ogive Plots
Very often it is helpful to know how much of the data accumulates over the range of the distribution. To do this, we will add to our frequency table by including the cumulative frequency, which is how many of the data points are in all the classes up to and including that class.
Number of Plastic Beverage Bottles per Week | Frequency | Cumulative Frequency |
---|---|---|
Figure: Cumulative Frequency Table for Bottle Data
For example, the cumulative frequency for per week is because students consumed or fewer bottles per week. Notice that the cumulative frequency for the last class is the same as the total number of students in the data. This should always be the case.
If we drew a histogram of the cumulative frequencies, or a cumulative frequency histogram, it would look as follows:
A relative cumulative frequency histogram, would be the same plot, only using the relative frequencies:
Number of Plastic Beverage Bottles per Week | Frequency |
Cumulative Frequency |
Relative Cumulative Frequency |
---|---|---|---|
Figure: Cumulative Frequency Table for Bottle Data
Remembering what we did with a frequency polygon, we can remove the bins to create a new type of plot. In the frequency polygon, we used the midpoint of the bin width. It is slightly different for a relative cumulative frequency plot. This time we will plot the points on the right side of each bin.
The reason for this should make a lot of sense: when we read this plot, each point should represent the percentage of the total data that is less than or equal to that value, just like the frequency table. For example, the point that is plotted at , corresponds to because that is the percentage of the data that is greater than or equal to and less than . It does not include the because they are in the bin to the right of that point. This is why we plot a point at on the horizontal axis and and on the vertical axis. None of the data is lower than , and similarly all of the data is below . Here is the final version of the plot.
"Relative cumulative frequency plot" is quite a mouthful! This plot is commonly referred to as an Ogive Plot. The name ogive comes from a particular shaped arch originally present in Arabic architecture and later incorporated in Gothic cathedrals. Here is a picture of a cathedral in Ecuador with a close-up of an ogive type arch.
If the distribution is symmetric and mound shaped, then the ogive plot will look just like the shape of one half of such an arch.
Lesson Summary
A frequency table is useful to organize data into classes according to the number of occurrences in each class, or frequency. Relative frequency shows the percentage of data in each class. A graphical representation of a frequency table (either actual or relative frequencies) that uses bins to show the amount in each class is called a histogram. Though it looks very similar, a bar graph is only used for categorical variables. A frequency polygon is created by plotting the midpoints of each bin at their frequencies and connecting the points with line segments. Frequency polygons are useful for viewing the overall shape of a distribution of data as well as comparing multiple data sets. For any distribution of data you should always be able to describe the shape, center, and spread. Data that is mound shaped can be classified as either symmetric or skewed. Distributions that are skewed left have the bulk of the data concentrated on the higher end of the distribution and the lower end or tail of the distribution is spread out to the left. A skewed right distribution has a large portion of the data concentrated in the lower values of the variable with a tail spread out to the right. An ogive plot, or relative cumulative frequency plot shows how the data accumulates across the different values of the variable.
Points to Consider
- What characteristics of a data set make it easier or harder to represent it using frequency tables, histograms, or frequency polygons?
- What characteristics of a data set make representing it using frequency tables, histograms, frequency polygons, or ogives more or less useful?
- What effects does the shape of a data set have on the statistical measures of center and spread?
- How do you determine the most appropriate classification to use for a frequency table or bin width to use for a histogram?
Review Questions
- Lois was gathering data on the plastic beverage bottle consumption habits of her classmates, but she ran out of time as class was ending. When she arrived home, something had spilled in her backpack and smudged the data for the . Fortunately, none of the other values was affected and she knew there were total students in the class. Complete her frequency table.
Number of Plastic Beverage Bottles per Week | Tally | Frequency |
---|---|---|
- The following frequency table contains exactly one data value that is a positive multiple of ten. What must that value be?
Class | Frequency |
---|---|
(a)
(b)
(c)
(d)
(e) There is not enough information to determine the answer.
- The following table includes the data from the same group of countries from the earlier bottled water consumption example, but is for the year 1999 instead.
Country | Liters of Bottled Water Consumed per Person per Year |
---|---|
Italy | |
Mexico | |
United Arab Emirates | |
Belgium and Luxembourg | |
France | |
Spain | |
Germany | |
Lebanon | |
Switzerland | |
Cyprus | |
United States | |
Saudi Arabia | |
Czech Republic | |
Austria | |
Portugal |
Figure: Bottled Water Consumption per Person in Leading Countries in 1999. Source: http://www.earth-policy.org/Updates/2006/Update51_data.htm)
(a) Create a frequency table for this data set.
(b) Create the histogram for this data set.
(c) How would you describe the shape of this data set?
- The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials.
Manufactured Material | Energy Saved (millions of BTU’s per ton) |
---|---|
Aluminum Cans | |
Copper Wire | |
Steel Cans | |
LDPE Plastics (e.g. trash bags) | |
PET Plastics (e.g. beverage bottles) | |
HDPE Plastics (e.g. household cleaner bottles) | |
Personal Computers | |
Carpet | |
Glass | |
Corrugated Cardboard | |
Newspaper | |
Phone Books | |
Magazines | |
Office Paper |
Amount of energy saved by manufacturing different materials using the maximum percentage of recycled material as opposed to using all new material (Source: National Geographic, January 2008. Volume 213 No.1 , pg 82-83)
(a) Complete the frequency table below including the actual frequency, the relative frequency(round to the nearest tenth of a percent), and the relative cumulative frequency.
(b) Create a relative frequency histogram from your table in part a.
(c) Draw the corresponding frequency polygon.
(d) Create the ogive plot.
(e) Comment on the shape, center, and spread of this distribution as it relates to the original data (Do not actually calculate any specific statistics).
(f) Add up the relative frequency column. What is the total? What should it be? Why might the total not be what you would expect?
(g) There is a portion of your ogive plot that should be horizontal. Explain what is happening with the data in this area that creates this horizontal section.
(h) What does the steepest part of an ogive plot tell you about the distribution?
Review Answers
- There are tally marks, which means that the remaining students must have been “2”s.
Number of Plastic Beverage Bottles per Week | Tally | Frequency |
---|---|---|
Liters of Water per Person | Frequency |
---|---|
Completed Frequency Table for World Bottled Water Consumption Data(1999)
(b)
Student answers may vary if they choose a different bin width for their histogram.
(c) This data set does appear to be have some characteristics of being skewed right. There also appears to be two distinct mounds. This shape is called “bimodal”.
- (a)
Class | Frequency |
Relative Frequency(%) |
Cumulative Frequency |
Relative Cumulative Frequency(%) |
---|---|---|---|---|
(b)
(c)
(d)
(e) This distribution is skewed to the right, which means that most of the materials are concentrated in the area of saving up to BTU’s by using recycled materials and there are just a few materials (copper wire, carpet, and aluminum cans) that use inordinately large amounts of energy to create from raw materials.
(f) . The total should be all of the data, or . The reason for the difference is rounding error.
(g) The horizontal portion of the ogive is where there is no data present, so the amount of accumulated data does not change.
(h) Because the ogive shows the increase in the percentage of data, the steepest section (in this case between and ) is where most of the data is located and the accumulation of data is therefore changing at the most rapid pace.