Histograms and Frequency Distributions
Activity: The Effect of Bin Width on the Shape of a Histogram
Describing the shape of a histogram is not always a straightforward task. There is not always one correct answer. Usually, combinations of words must be used to obtain phrases like, “approximately normal with an outlier”. To make matters more ambiguous, how the data is grouped often affects the shape of the histogram.
Materials: The instructor needs a graphing calculator and an overhead display mechanism. Students can follow along with their graphing calculators for practice, but this will slow down the activity.
- Gather several sets of data from the students. Chose sets that you would expect to have different shapes. Their heights will be approximately normal, but the number of pets that they have will most likely be skewed with outliers. So as not to use class time collecting and entering data, students can provide the information the day before on index cards and the instructor can enter the data into the calculator before class. The lists can be transferred to the students’ calculator with a cord if the students are to follow along.
- Make a histogram with one of the data sets. Start with a set that will be easy to describe. Display the histogram and ask students to describe the shape. Display the same data set with different bin widths and compare the resulting histograms. Sometimes the shape appears to be quite different.
- Repeat with the other sets of data.
Additional Topics for Discussion:
- Find the mean, median, and standard deviation for each data set. Note that in the case of outliers and skewed data, the mean is pulled toward the outlier or tail. How is the standard deviation affected by outliers and skewed data?
- This is a good time to discuss how subjective statistics can be, and how data can be manipulated to seem to support various points of view.
Common Graphs and Data Plots
Technology Project: The Right Graph for the Data (with Excel)
Learning to create these different graphs is not terribly challenging for students. Choosing the best graph or data plot to display a specific set of data is the most important skill students will take away from this lesson. This task also gives the students practice using the powerful, and commonly used tool, a spreadsheet.
1. Find three sets of data, each with at least elements. Collect this data yourself or get it from a reliable source. Cite the source of your data or describe your collection method.
- One set of data will be categorical, to be used in a bar and pie graph.
- One set of data will be bivariate, and will be used in a scatter plot. Choose two variables that they believe will have a fairly strong association.
- One set of data will be bivariate with the explanatory variable being time. This data will be displayed in a line plot.
2. Each data set will be entered into columns in a different page of a spreadsheet program. The first cell in each column should contain a title. Select the data and insert the proper graph of plot for each of the three data sets.
3. Write a paragraph describing the plots and graphs using vocabulary from this section of the text. What have you learned about the data sets form the visual display that you made?
This assignment reverses the typical situation. Here students are looking for data to fit a specific graph. It still gives the students the opportunity to match data sets with visual displays. If time allows have the students present their graphs and plots to the class. Orally describing their work to others will make it more meaningful for them.
Box and Whiskers Plots
Activity: Stem-and-Leaf Plot to Box-and-Whiskers Plot
Students familiarized themselves with stem-and-leaf plots in the previous section. A stem-and-leaf plot is basically a histogram made of the ones digits of the numbers in the data set. They are a good representation of small data sets because the actual values are retained, while also giving a visual representation of the data. Students have seen variations of this method for representing data many times before. They understand it well. The box-and-whiskers plot is a new concept; it is based on position instead of value. Students will need some experience with this type of display before they will be able to gain a good understanding the data from a box-and-whiskers plot.
- Select some stem-and-leaf plots with different shapes that you have made in the past or have seen in the text or elsewhere. (The instructor can make the selection or leave it up to the students.)
- Describe the shape, center, and spread of the data.
- For each stem-and-leaf plot make a box-and-whiskers plot of the same data.
- Does seeing the data displayed in a different way make you want to change your description of the data’s shape, center, and spread?
- Which plot would be easier to make for a large data set? In what circumstances would you chose to use the stem-and-leaf plot? The box-and-whiskers plot?
This activity will give students practice reading the numbers from stem-and-leaf plots, and making box-and-whisker plots. Most importantly though, it will teach the students how to interpret box-and-whiskers plots and get them thinking about the strengths and weaknesses of the different types of visual displays of data that they have learned to make so they can chose the best method in any situation.
Investigation: The Effect of an Outlier on Measures of Spread
The most important measure of spread used in statistics is by far the standard deviation/variance. Students need to realize that the standard deviation as well, as the mean, are not resistant to outliers or skewed data.
Use the reservoir data for California given in this lesson.
- Calculate the interquartile range for the data. Remove the outlier of from the set and calculate the interquartile range again.
- Calculate the standard deviation of this sample. Remove the outlier and calculate the standard deviation again.
- Calculate the percent change for each measure of spread.
- Use the calculations made in to evaluate how well the interquartile range and standard deviation represent the original data (before the outlier was removed).
- The interquartile range is the better representation of spread for this set of data. In the case of the standard deviation, it does not seem reasonable for one value to have such a large affect on a single summary statistic.
Many calculations in statistics can only be done with a standard deviation, so the standard deviation must be used even if the data is heavily skewed or there are significant outliers. In these situations statisticians may chose to trim the data set, or leave off outliers. This investigation will help student see why this is a reasonable