<meta http-equiv="refresh" content="1; url=/nojavascript/">

# Histograms

## Visual representation of data on a histogram and the many graphs associated with this display

%
Progress
Practice Histograms
Progress
%
Histograms

In this Concept, you will learn about displaying and interpreting data using two kinds of graphs: histograms and ogives.

### Watch This

For a description of how to make a histogram from given data (14.0) , see onlinestatbook, Graphing Distributions: Histograms (6:21).

Citation: Online Statistics Education: A Multimedia Course of Study ( http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University.

### Guidance

The earth has seemed so large in scope for thousands of years that it is only recently that many people have begun to take seriously the idea that we live on a planet of limited and dwindling resources. This is something that residents of the Galapagos Islands are also beginning to understand. Because of its isolation and lack of resources to support large, modernized populations of humans, the problems that we face on a global level are magnified in the Galapagos. Basic human resources such as water, food, fuel, and building materials must all be brought in to the islands. More problematically, the waste products must either be disposed of in the islands, or shipped somewhere else at a prohibitive cost. As the human population grows exponentially, the Islands are confronted with the problem of what to do with all the waste. In most communities in the United States, it is easy for many to put out the trash on the street corner each week and perhaps never worry about where that trash is going. In the Galapagos, the desire to protect the fragile ecosystem from the impacts of human waste is more urgent and is resulting in a new focus on renewing, reducing, and reusing materials as much as possible. There have been recent positive efforts to encourage recycling programs.

It is not easy to bury tons of trash in solid volcanic rock. The sooner we realize that we are in the same position of limited space and that we have a need to preserve our global ecosystem, the more chance we have to save not only the uniqueness of the Galapagos Islands, but that of our own communities. All of the information in this chapter is focused around the issues and consequences of our recycling habits, or lack thereof!

Water, Water, Everywhere!

Bottled water consumption worldwide has grown, and continues to grow at a phenomenal rate. According to the Earth Policy Institute, 154 billion gallons were produced in 2004. While there are places in the world where safe water supplies are unavailable, most of the growth in consumption has been due to other reasons. The largest consumer of bottled water is the United States, which arguably could be the country with the best access to safe, convenient, and reliable sources of tap water. The large volume of toxic waste that is generated by the plastic bottles and the small fraction of the plastic that is recycled create a considerable environmental hazard. In addition, huge volumes of carbon emissions are created when these bottles are manufactured using oil and transported great distances by oil-burning vehicles.

#### Example A

Take an informal poll of your class. Ask each member of the class, on average, how many beverage bottles they use in a week. Once you collect this data, the first step is to organize it so it is easier to understand. A frequency table is a common starting point. Frequency tables simply display each value of the variable, and the number of occurrences (the frequency) of each of those values. In this example, the variable is the number of plastic beverage bottles of water consumed each week.

Consider the following raw data:

6, 4, 7, 7, 8, 5, 3, 6, 8, 6, 5, 7, 7, 5, 2, 6, 1, 3, 5, 4, 7, 4, 6, 7, 6, 6, 7, 5, 4, 6, 5, 3

Here are the correct frequencies using the imaginary data presented above:

Figure: Imaginary Class Data on Water Bottle Usage

Completed Frequency Table for Water Bottle Data
Number of Plastic Beverage Bottles per Week Frequency
1 1
2 1
3 3
4 4
5 6
6 8
7 7
8 2

When creating a frequency table, it is often helpful to use tally marks as a running total to avoid missing a value or over-representing another.

Frequency table using tally marks
Number of Plastic Beverage Bottles per Week Tally Frequency
1 ${\color{red} | }$ 1
2 ${\color{red} | }$ 1
3 ${\color{red} | | | }$ 3
4 ${\color{red} | | | | }$ 4
5 ${\color{red} \bcancel{ | | | | } \ | }$ 6
6 ${\color{red} \bcancel{ | | | | } \ | | | }$ 8
7 ${\color{red} \bcancel{ | | | | } \ | | }$ 7
8 ${\color{red} | | }$ 2

The following data set shows the countries in the world that consume the most bottled water per person per year.

Country Liters of Bottled Water Consumed per Person per Year
Italy 183.6
Mexico 168.5
United Arab Emirates 163.5
Belgium and Luxembourg 148.0
France 141.6
Spain 136.7
Germany 124.9
Lebanon 101.4
Switzerland 99.6
Cyprus 92.0
United States 90.5
Saudi Arabia 87.8
Czech Republic 87.1
Austria 82.1
Portugal 80.3

Figure: Bottled Water Consumption per Person in Leading Countries in 2004.

These data values have been measured at the ratio level. There is some flexibility required in order to create meaningful and useful categories for a frequency table. The values range from 80.3 liters to 183 liters. By examining the data, it seems appropriate for us to create our frequency table in groups of 10. We will skip the tally marks in this case, because the data values are already in numerical order, and it is easy to see how many are in each classification.

A bracket, '[' or ']', indicates that the endpoint of the interval is included in the class. A parenthesis, '(' or ')', indicates that the endpoint is not included. It is common practice in statistics to include a number that borders two classes as the larger of the two numbers in an interval. For example, $[80-90)$ means this classification includes everything from 80 and gets infinitely close to, but not equal to, 90. 90 is included in the next class, $[90 -100)$ .

Liters per Person Frequency
$[80-90)$ 4
$[90-100)$ 3
$[100-110)$ 1
$[110-120)$ 0
$[120-130)$ 1
$[130-140)$ 1
$[140-150)$ 2
$[150-160)$ 0
$[160-170)$ 2
$[170-180)$ 0
$[180-190)$ 1

Figure: Completed Frequency Table for World Bottled Water Consumption Data (2004)

Histograms

Once you can create a frequency table, you are ready to create our first graphical representation, called a histogram . Let's revisit our data about student bottled beverage habits.

Completed Frequency Table for Water Bottle Data
Number of Plastic Beverage Bottles per Week Frequency
1 1
2 1
3 3
4 4
5 6
6 8
7 7
8 2

Here is the same data in a histogram:

In this case, the horizontal axis represents the variable (number of plastic bottles of water consumed), and the vertical axis is the frequency, or count. Each vertical bar represents the number of people in each class of ranges of bottles. For example, in the range of consuming $[1 -2)$ bottles, there is only one person, so the height of the bar is at 1. We can see from the graph that the most common class of bottles used by people each week is the $[6-7)$ range, or six bottles per week.

A histogram is for numerical data. With histograms, the different sections are referred to as bins . Think of a column, or bin, as a vertical container that collects all the data for that range of values. If a value occurs on the border between two bins, it is commonly agreed that this value will go in the larger class, or the bin to the right. It is important when drawing a histogram to be certain that there are enough bins so that the last data value is included. Often this means you have to extend the horizontal axis beyond the value of the last data point. In this example, if we had stopped the graph at 8, we would have missed that data, because the 8's actually appear in the bin between 8 and 9. Very often, when you see histograms in newspapers, magazines, or online, they may instead label the midpoint of each bin. Some graphing software will also label the midpoint of each bin, unless you specify otherwise.

On the Web

http://illuminations.nctm.org/ActivityDetail.aspx?ID=78 Here you can change the bin width and explore how it effects the shape of the histogram.

Relative Frequency Histogram

A relative frequency histogram is just like a regular histogram, but instead of labeling the frequencies on the vertical axis, we use the percentage of the total data that is present in that bin. For example, there is only one data value in the first bin. This represents $\frac{1}{32}$ , or approximately 3%, of the total data. Thus, the vertical bar for the bin extends upward to 3%.

Frequency Polygons

A frequency polygon is similar to a histogram, but instead of using bins, a polygon is created by plotting the frequencies and connecting those points with a series of line segments.

To create a frequency polygon for the bottle data, we first find the midpoints of each classification, plot a point at the frequency for each bin at the midpoint, and then connect the points with line segments. To make a polygon with the horizontal axis, plot the midpoint for the class one greater than the maximum for the data, and one less than the minimum.

Here is a frequency polygon constructed directly from the previously-shown histogram:

Here is the frequency polygon in finished form:

Frequency polygons are helpful in showing the general overall shape of a distribution of data. They can also be useful for comparing two sets of data. Imagine how confusing two histograms would look graphed on top of each other!

#### Example B

It would be interesting to compare bottled water consumption in two different years. Two frequency polygons would help give an overall picture of how the years are similar, and how they are different. In the following graph, two frequency polygons, one representing 1999, and the other representing 2004, are overlaid. 1999 is in red, and 2004 is in green.

It appears there was a shift to the right in all the data, which is explained by realizing that all of the countries have significantly increased their consumption. The first peak in the lower-consuming countries is almost identical in the two frequency polygons, but it increased by 20 liters per person in 2004. In 1999, there was a middle peak, but that group shifted significantly to the right in 2004 (by between 40 and 60 liters per person). The frequency polygon is the first type of graph we have learned about that makes this type of comparison easier.

Cumulative Frequency Histograms and Ogive Plots

Very often, it is helpful to know how the data accumulate over the range of the distribution. To do this, we will add to our frequency table by including the cumulative frequency, which is how many of the data points are in all the classes up to and including a particular class.

Number of Plastic Beverage Bottles per Week Frequency Cumulative Frequency
1 1 1
2 1 2
3 3 5
4 4 9
5 6 15
6 8 23
7 7 30
8 2 32

Figure: Cumulative Frequency Table for Bottle Data

#### Example C

The cumulative frequency for 5 bottles per week is 15, because 15 students consumed 5 or fewer bottles per week. Notice that the cumulative frequency for the last class is the same as the total number of students in the data. This should always be the case.

If we drew a histogram of the cumulative frequencies, or a cumulative frequency histogram , it would look as follows:

A relative cumulative frequency histogram would be the same, except that the vertical bars would represent the relative cumulative frequencies of the data:

Number of Plastic Beverage Bottles per Week Frequency Cumulative Frequency Relative Cumulative Frequency (%)
1 1 1 3.1
2 1 2 6.3
3 3 5 15.6
4 4 9 28.1
5 6 15 46.9
6 8 23 71.9
7 7 30 93.8
8 2 32 100

Figure: Relative Cumulative Frequency Table for Bottle Data

Remembering what we did with the frequency polygon, we can remove the bins to create a new type of plot. In the frequency polygon, we connected the midpoints of the bins. In a relative cumulative frequency plot , we use the point on the right side of each bin.

The reason for this should make a lot of sense: when we read this plot, each point should represent the percentage of the total data that is less than or equal to a particular value, just like in the frequency table. For example, the point that is plotted at 4 corresponds to 15.6%, because that is the percentage of the data that is less than or equal to 3. It does not include the 4's, because they are in the bin to the right of that point. This is why we plot a point at 1 on the horizontal axis and at 0% on the vertical axis. None of the data is lower than 1, and similarly, all of the data is below 9. Here is the final version of the plot:

This plot is commonly referred to as an ogive plot . The name ogive comes from a particular pointed arch originally present in Arabic architecture and later incorporated in Gothic cathedrals. Here is a picture of a cathedral in Ecuador with a close-up of an ogive-type arch:

If a distribution is symmetric and mound shaped, then its ogive plot will look just like the shape of one half of such an arch.

In the first chapter, we introduced measures of center and spread as important descriptors of a data set. The shape of a distribution of data is very important as well. Shape, center, and spread should always be your starting point when describing a data set.

Referring to our imaginary student poll on using plastic beverage containers, we notice that the data are spread out from 0 to 9. The graph for the data illustrates this concept, and the range quantifies it. Look back at the graph and notice that there is a large concentration of students in the 5, 6, and 7 region. This would lead us to believe that the center of this data set is somewhere in this area. We use the mean and/or median to measure central tendency, but it is also important that you see that the center of the distribution is near the large concentration of data. This is done with shape.

Shape is harder to describe with a single statistical measure, so we will describe it in less quantitative terms. A very important feature of this data set, as well as many that you will encounter, is that it has a single large concentration of data that appears like a mountain. A data set that is shaped in this way is typically referred to as mound-shaped. Mound-shaped data will usually look like one of the following three pictures:

Think of these graphs as frequency polygons that have been smoothed into curves. In statistics, we refer to these graphs as density curves . The most important feature of a density curve is symmetry. The first density curve above is symmetric and mound-shaped. Notice the second curve is mound-shaped, but the center of the data is concentrated on the left side of the distribution. The right side of the data is spread out across a wider area. This type of distribution is referred to as skewed right. It is the direction of the long, spread out section of data, called the tail , that determines the direction of the skewing. For example, in the $3^{\text{rd}}$ curve, the left tail of the distribution is stretched out, so this distribution is skewed left . Our student bottle data set has this skewed-left shape.

On the Web

### Guided Practice

There is some question as to whether caloric content listed on food products is under-reported. Look at the following table of kinds of food products (Food) and the percentage difference between measured calories and labeled calories per item (Per Item).

Caloric Data on food items.
Food Per item
noodles and alfredo sauce 2
cheese curls $-$ 28
green beans $-$ 6
mixed fruits 8
cereal 6
fig bars $-$ 1
crumb cake 13
crackers 15
blue cheese dressing $-$ 4
imperial chicken $-$ 4
vegetable soup $-$ 18
cheese 10
chocolate pudding 5
sausage biscuit 3
lasagna $-$ 7
lentil soup $-$ 0.5
pasta with shrimp and tomato sauce $-$ 10
chocolate mousse 6
meatless sandwich 41
lemon pound cake 2
banana cake 25
brownie 39
butterscotch bar 16.5
blondie 17
oat bran snack bar 28
granola bar $-$ 3
apricot bar 14
carrot muffin 42
chinese chicken 15
gyoza 60
jelly diet candy-reds flavor 250
jelly diet candy-fruit flavor 145
Florentine manicotti 6
egg foo young 80

Draw a histogram of the percentage difference between observed and reported calories per item.

Solution:

First, break up the percentage difference in calories per item, into intervals. By glancing at the data, it looks like using an interval length of 30 will work well, with the first interval being from -30 to zero. Count how many food items fall into each intervals, and then graph this frequency as the height on the vertical axis. For example, there are 10 food items that have a percentage difference of calories between -30 and 0, so we draw a bar with a height of 10 for that interval.

### Explore More

1. Lois was gathering data on the plastic beverage bottle consumption habits of her classmates, but she ran out of time as class was ending. When she arrived home, something had spilled in her backpack and smudged the data for the 2's. Fortunately, none of the other values was affected, and she knew there were 30 total students in the class. Complete her frequency table.
Number of Plastic Beverage Bottles per Week Tally Frequency
1 ${\color{red} | |}$
2
3 ${\color{red} | | |}$
4 ${\color{red} | | }$
5 ${\color{red} | | | }$
6 ${\color{red}\bcancel{ | | | | } \ | | }$
7 ${\color{red}\bcancel{| | | | }\ | }$
8 ${\color{red} | }$
1. The following frequency table contains exactly one data value that is a positive multiple of ten. What must that value be?
1. 10
2. 20
3. 30
4. 40
5. There is not enough information to determine the answer.
Class Frequency
$[0 - 5)$ 4
$[5 - 10)$ 0
$[10 - 15)$ 2
$[15 - 20)$ 1
$[20 - 25)$ 0
$[25 - 30)$ 3
$[30 - 35)$ 0
$[35 - 40)$ 1
1. The following table includes the data from the same group of countries from the earlier bottled water consumption example, but is for the year 1999, instead.
Country Liters of Bottled Water Consumed per Person per Year
Italy 154.8
Mexico 117.0
United Arab Emirates 109.8
Belgium and Luxembourg 121.9
France 117.3
Spain 101.8
Germany 100.7
Lebanon 67.8
Switzerland 90.1
Cyprus 67.4
United States 63.6
Saudi Arabia 75.3
Czech Republic 62.1
Austria 74.6
Portugal 70.4

Figure: Bottled Water Consumption per Person in Leading Countries in 1999.

a. Create a frequency table for this data set.

b. Create the histogram for this data set.

c. How would you describe the shape of this data set?

1. The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials.
Manufactured Material Energy Saved (millions of BTU's per ton)
Aluminum Cans 206
Copper Wire 83
Steel Cans 20
LDPE Plastics (e.g., trash bags) 56
PET Plastics (e.g., beverage bottles) 53
HDPE Plastics (e.g., household cleaner bottles) 51
Personal Computers 43
Carpet 106
Glass 2
Corrugated Cardboard 15
Newspaper 16
Phone Books 11
Magazines 11
Office Paper 10

Amount of energy saved by manufacturing different materials using the maximum percentage of recycled material as opposed to using all new material. Source: National Geographic, January 2008. Volume 213 No., pg 82-83.

a. Construct a frequency table, including the actual frequency, the relative frequency (round to the nearest tenth of a percent), and the relative cumulative frequency. Assume a bin width of 25 million BTUs.

b. Create a relative frequency histogram from your table in part a.

c. Draw the corresponding frequency polygon.

d. Create the ogive plot.

e. Comment on the shape, center, and spread of this distribution as it relates to the original data. (Do not actually calculate any specific statistics).

f. Add up the relative frequency column. What is the total? What should it be? Why might the total not be what you would expect?

g. There is a portion of your ogive plot that should be horizontal. Explain what is happening with the data in this area that creates this horizontal section.

h. What does the steepest part of an ogive plot tell you about the distribution?

1. The figure above is a histogram of the salaries of CEOs.
1. Are there any outliers? For any outlier, give a value for the salary and explain why you think it is an outlier.
2. What is the salary that occurs most often? Roughly, how many CEO’s report having this salary?
3. Roughly, how many CEOs report having \$500,000?
2. Forbes, November 8, 1993, “America’s Best Small Companies” provided data on the salaries and age of the chief executive offers (including bonuses) of small companies. Below is a table of the age of the CEO first 60 rank companies. Create a histogram for the age of the CEO. Provide a summary of the dataset based on your histogram.

$\text{Age of CEO} && \text{Frequency}\\30 - 35 && 2\\36 - 40 && 3\\41 - 45 && 6\\46 - 50 && 14\\51 - 55 && 12\\56 - 60 && 12\\61 - 65 && 7\\6 - 70 && 2\\71 -75 && 2$

1. What characteristics of a data set make it easier or harder to represent it using frequency tables, histograms, or frequency polygons?
2. What characteristics of a data set make representing it using frequency tables, histograms, frequency polygons, or ogive plots more or less useful?
3. What effects does the shape of a data set have on the statistical measures of center and spread?
4. How do you determine the most appropriate classification to use for a frequency table or the bin width to use for a histogram?

Technology Notes: Histograms on the TI-83/84 Graphing Calculator

To draw a histogram on your TI-83/84 graphing calculator, you must first enter the data in a list. In the home screen, press [2ND][{] , and then enter the data separated by commas (see the screen below). When all the data have been entered, press [2ND][}][STO] , and then press [2ND][L1][ENTER] .

Now you are ready to plot the histogram. Press [2ND][STAT PLOT] to enter the STAT-PLOTS menu. You can plot up to three statistical plots at one time. Choose Plot1 . Turn the plot on, change the type of plot to a histogram (see sample screen below), and choose L1 . Enter '1' for the Freq by pressing [2ND][A-LOCK] to turn off alpha lock, which is normally on in this menu, because most of the time you would want to enter a variable here. An alternative would be to enter the values of the variables in L1 and the frequencies in L2 as we did in Chapter 1.

Finally, we need to set a window. Press [WINDOW] and enter an appropriate window to display the plot. In this case, 'XSCL' is what determines the bin width. Also notice that the maximum $x$ value needs to go up to 9 to show the last bin, even though the data values stop at 8. Enter all of the values shown below.

Press [GRAPH] to display the histogram. If you press [TRACE] and then use the left or right arrows to trace along the graph, notice how the calculator uses the notation to properly represent the values in each bin.

### Vocabulary Language: English

bar chart

bar chart

A bar chart is a graphic display of categorical variables that uses bars to represent the frequency of the count in each category.
bar graph

bar graph

A bar graph is a plot made of bars whose heights (vertical bars) or lengths (horizontal bars) represent the frequencies of each category, with space between each bar.
bell curve

bell curve

A normal distribution curve is also known as a bell curve.
bell shaped

bell shaped

A bell shaped histogram is a histogram with a prominent ‘mound’ in the center and similar tapering to the left and right.
binning

binning

Binning involves separating your data separated into separate classes or categories.
bins

bins

Bins are groups of data plotted on the x-axis.
class limits

class limits

Class limits are, collectively, the upper and lower limit of an interval.
class mark

class mark

A class mark is the middle value, or average of the class limits.
extreme outliers

extreme outliers

Extreme outliers include points more than 3 times the middle half of your data.      .
frequency density

frequency density

The vertical axis of a histogram is labelled frequency density.
frequency distribution table

frequency distribution table

A frequency distribution table lists the data values, as well as the number of times each value appears in the data set.
frequency polygon

frequency polygon

A frequency polygon is a graph constructed by using lines to join the midpoints of each interval, or bin.
Frequency table

Frequency table

A frequency table is a table that summarizes a data set by stating the number of times each value occurs within the data set.
Histogram

Histogram

A histogram is a display that indicates the frequency of specified ranges of continuous data values on a graph in the form of immediately adjacent bars.
Interval

Interval

An interval is a range of data in a data set.
left-skewed distribution

left-skewed distribution

A left-skewed distribution has a peak to the right of the distribution and data values that taper off to the left.
mild outliers

mild outliers

Mild outliers include data points that are more than 1.5 times the middle half of your data above the upper, or below the lower, quartiles.
multimodal

multimodal

When a set of data has more than 2 values that occur with the same greatest frequency, the set is called multimodal    .
normal distributed

normal distributed

If data is normally distributed, the data set creates a symmetric histogram that looks like a bell.
Outlier

Outlier

In statistics, an outlier is a data value that is far from other data values.
Range

Range

The range of a data set is the difference between the smallest value and the greatest value in the data set.
relative cumulative frequency plot (ogive plot)

relative cumulative frequency plot (ogive plot)

A relative cumulative frequency plot, or  ogive plot, shows how the data accumulate across the different values of the variable.
relative frequency histogram

relative frequency histogram

A relative cumulative frequency histogram is a histogram except the vertical bars as the relative cumulative frequencies.
right-skewed distribution

right-skewed distribution

A right-skewed distribution has a peak to the left of the distribution and data values that taper off to the right.
shape

shape

The shape of a histogram can lead to valuable conclusions about the trend(s) of the data.
skewed

skewed

As with the horizontal skewing of a histogram, stem plots with a obvious skew toward one end or the other tend to indicate an increased number of outliers either lesser than or greater than the mode.
symmetric

symmetric

In statistics, a distribution is considered symmetric if  the data set that is mound-shaped.
symmetric histogram

symmetric histogram

For a symmetric histogram, the values of the mean, median, and mode are all the same and are all located at the center of the distribution.
undefined bimodal

undefined bimodal

A undefined bimodal histogram has a shape is not specifically defined, but we can note regardless that it is bimodal, having two separated classes or intervals equally representing the maximum frequency of the distribution.
uniform

uniform

A uniform shaped histogram indicates data that is very consistent; the frequency of each class is very similar to that of the others.
unimodal

unimodal

If a data set has only 1 value that occurs most often, the set is called  unimodal.