Normal Distributions
The Characteristics of a Normal Distribution
Think about the popcorn example in the introduction to this Chapter. The amount of popcorn popping starts small, gets larger, and then decreases as time goes by, much like the graph below.
This graph is considered a normal curve, or a "bell curve" because it looks like a bell. Many data sets look similar to this when plotted. But are they all exactly the same? The answer is no; they may be centered at different values, and some may be more spread out than others. While still having this similar bell shape, there may be slight differences in the exact shape.
Shape
When graphing the data from each of the examples in the introduction, the distributions from each of these situations would be mound-shaped and mostly symmetric. A normal distribution is a perfectly symmetric, mound-shaped distribution.
Because so many real data sets closely approximate a normal distribution, we can use the idealized normal curve to learn a great deal about such data. With a practical data collection, the distribution will never be exactly symmetric, so just like situations involving probability, a true normal distribution only results from an infinite collection of data. Also, it is important to note that the normal distribution describes a continuous random variable.
Center
Due to the exact symmetry of a normal curve, the center of a normal distribution, or a data set that approximates a normal distribution, is located at the highest point of the distribution, and all the statistical measures of center we have already studied (the mean, median, and mode) are equal.
It is also important to realize that this center peak divides the data into two equal parts.
Spread
Let’s go back to our popcorn example. The bag advertises a certain time, beyond which you risk burning the popcorn. From experience, the manufacturers know when most of the popcorn will stop popping, but there is still a chance that there are those rare kernels that will require more (or less) time to pop than the time advertised by the manufacturer. The directions usually tell you to stop when the time between popping is a few seconds, but aren’t you tempted to keep going so you don’t end up with a bag full of un-popped kernels? Because this is a real, and not theoretical, situation, there will be a time when the popcorn will stop popping and start burning, but there is always a chance, no matter how small, that one more kernel will pop if you keep the microwave going. In an idealized normal distribution of a continuous random variable, the distribution continues infinitely in both directions.
Because of this infinite spread, the range would not be a useful statistical measure of spread. The most common way to measure the spread of a normal distribution is with the standard deviation, or the typical distance away from the mean. Because of the symmetry of a normal distribution, the standard deviation indicates how far away from the maximum peak the data will be. Here are two normal distributions with the same center (mean):
The first distribution pictured above has a smaller standard deviation, and so more of the data are heavily concentrated around the mean than in the second distribution. Also, in the first distribution, there are fewer data values at the extremes than in the second distribution. Because the second distribution has a larger standard deviation, the data are spread farther from the mean value, with more of the data appearing in the tails.
The Empirical Rule
Because of the similar shape of all normal distributions, we can measure the percentage of data that is a certain distance from the mean no matter what the standard deviation of the data set is. The following graph shows a normal distribution with \begin{align*}\mu=0\end{align*} and \begin{align*}\sigma=1\end{align*}. This curve is called a standard normal curve. In this case, the values of \begin{align*}x\end{align*} represent the number of standard deviations away from the mean.
Notice that vertical lines are drawn at points that are exactly one standard deviation to the left and right of the mean. We have consistently described standard deviation as a measure of the typical distance away from the mean. How much of the data is actually within one standard deviation of the mean? To answer this question, think about the space, or area, under the curve. The entire data set, or 100% of it, is contained under the whole curve. What percentage would you estimate is between the two lines? To help estimate the answer, we can use a graphing calculator. Graph a standard normal distribution over an appropriate window.
Now press [2ND][DISTR], go to the DRAW menu, and choose 'ShadeNorm('. Insert '\begin{align*}-1\end{align*}, 1' after the 'ShadeNorm(' command and press [ENTER]. It will shade the area within one standard deviation of the mean.
The calculator also gives a very accurate estimate of the area. We can see from the rightmost screenshot above that approximately 68% of the area is within one standard deviation of the mean. If we venture to 2 standard deviations away from the mean, how much of the data should we expect to capture? Make the following changes to the 'ShadeNorm(' command to find out:
Notice from the shading that almost all of the distribution is shaded, and the percentage of data is close to 95%. If you were to venture to 3 standard deviations from the mean, 99.7%, or virtually all of the data, is captured, which tells us that very little of the data in a normal distribution is more than 3 standard deviations from the mean.
Notice that the calculator actually makes it look like the entire distribution is shaded because of the limitations of the screen resolution, but as we have already discovered, there is still some area under the curve further out than that. These three approximate percentages, 68%, 95%, and 99.7%, are extremely important and are part of what is called the Empirical Rule.
The Empirical Rule states that the percentages of data in a normal distribution within 1, 2, and 3 standard deviations of the mean are approximately 68%, 95%, and 99.7%, respectively.
\begin{align*}z\end{align*}-Scores
A \begin{align*}z\end{align*}-score is a measure of the number of standard deviations a particular data point is away from the mean. For example, let’s say the mean score on a test for your statistics class was an 82, with a standard deviation of 7 points. If your score was an 89, it is exactly one standard deviation to the right of the mean; therefore, your \begin{align*}z\end{align*}-score would be 1. If, on the other hand, you scored a 75, your score would be exactly one standard deviation below the mean, and your \begin{align*}z\end{align*}-score would be \begin{align*}-1\end{align*}. All values that are below the mean have negative \begin{align*}z\end{align*}-scores, while all values that are above the mean have positive \begin{align*}z\end{align*}-scores. A \begin{align*}z\end{align*}-score of \begin{align*}-2\end{align*} would represent a value that is exactly 2 standard deviations below the mean, so in this case, the value would be \begin{align*}82 - 14 = 68\end{align*}.
To calculate a \begin{align*}z\end{align*}-score for which the numbers are not so obvious, you take the deviation and divide it by the standard deviation.
\begin{align*}z = \frac{\text{Deviation}}{\text{Standard Deviation}}\end{align*}
You may recall that deviation is the mean value of the variable subtracted from the observed value, so in symbolic terms, the \begin{align*}z\end{align*}-score would be:
\begin{align*}z=\frac{x-\mu}{\sigma}\end{align*}
As previously stated, since \begin{align*}\sigma\end{align*} is always positive, \begin{align*}z\end{align*} will be positive when \begin{align*}x\end{align*} is greater than \begin{align*}\mu\end{align*} and negative when \begin{align*}x\end{align*} is less than \begin{align*}\mu\end{align*}. A \begin{align*}z\end{align*}-score of zero means that the term has the same value as the mean. The value of \begin{align*}z\end{align*} represents the number of standard deviations the given value of \begin{align*}x\end{align*} is above or below the mean.
Calculating the \begin{align*}z\end{align*}-Score
What is the \begin{align*}z\end{align*}-score for an \begin{align*}A\end{align*} on the test described above, which has a mean score of 82? (Assume that an \begin{align*}A\end{align*} is a 93.)
The \begin{align*}z\end{align*}-score can be calculated as follows:
\begin{align*}z&=\frac{x-\mu}{\sigma}\\ z & = \frac{93-82}{7}\\ z&=\frac{11}{7} \approx 1.57\end{align*}
If we know that the test scores from the last example are distributed normally, then a \begin{align*}z\end{align*}-score can tell us something about how our test score relates to the rest of the class. From the Empirical Rule, we know that about 68% of the students would have scored between a \begin{align*}z\end{align*}-score of \begin{align*}-1\end{align*} and 1, or between a 75 and an 89, on the test. If 68% of the data is between these two values, then that leaves the remaining 32% in the tail areas. Because of symmetry, half of this, or 16%, would be in each individual tail.
Application of a \begin{align*}z\end{align*}-Score
On a college entrance exam, the mean was 70, and the standard deviation was 8. If Helen’s \begin{align*}z\end{align*}-score was \begin{align*}-1.5\end{align*}, what was her exam score?
\begin{align*}z&=\frac{x-\mu}{\sigma}\\ \therefore z \cdot \sigma & = x-\mu\\ x&=\mu+z\cdot\sigma\\ x&=70+(-1.5)(8)\\ x&=58\end{align*}
Assessing Normality
The best way to determine if a data set approximates a normal distribution is to look at a visual representation. Histograms and box plots can be useful indicators of normality, but they are not always definitive. It is often easier to tell if a data set is not normal from these plots.
If a data set is skewed right, it means that the right tail is significantly longer than the left. Similarly, skewed left means the left tail has more weight than the right. A bimodal distribution, on the other hand, has two modes, or peaks. For instance, with a histogram of the heights of American 30-year-old adults, you will see a bimodal distribution\begin{align*}-\end{align*}one mode for males and one mode for females.
There is a plot we can use to determine if a distribution is normal called a normal probability plot or normal quantile plot. To make this plot by hand, first order your data from smallest to largest. Then, determine the quantile of each data point. Finally, using a table of standard normal probabilities, determine the closest z-score for each quantile. Plot these z-scores against the actual data values. To make a normal probability plot using your calculator, enter your data into a list, then use the last type of graph in the STAT PLOT menu, as shown below:
If the data set is normal, then this plot will be perfectly linear. The closer to being linear the normal probability plot is, the more closely the data set approximates a normal distribution.
Look below at the histogram and the normal probability plot for the same data.
The histogram is fairly symmetric and mound-shaped and appears to display the characteristics of a normal distribution. When the \begin{align*}z\end{align*}-scores of the quantiles of the data are plotted against the actual data values, the normal probability plot appears strongly linear, indicating that the data set closely approximates a normal distribution. The following example will allow you to see how a normal probability plot is made in more detail.
Assessing Normality of a Specific Data Set
The following data set tracked high school seniors' involvement in traffic accidents. The participants were asked the following question: “During the last 12 months, how many accidents have you had while you were driving (whether or not you were responsible)?”
Year | Percentage of high school seniors who said they were involved in no traffic accidents |
---|---|
1991 | 75.7 |
1992 | 76.9 |
1993 | 76.1 |
1994 | 75.7 |
1995 | 75.3 |
1996 | 74.1 |
1997 | 74.4 |
1998 | 74.4 |
1999 | 75.1 |
2000 | 75.1 |
2001 | 75.5 |
2002 | 75.5 |
2003 | 75.8 |
Figure: Percentage of high school seniors who said they were involved in no traffic accidents. Source: Sourcebook of Criminal Justice Statistics:
Here is a histogram and a box plot of this data:
The histogram appears to show a roughly mound-shaped and symmetric distribution. The box plot does not appear to be significantly skewed, but the various sections of the plot also do not appear to be overly symmetric, either. In the following chart, the data has been reordered from smallest to largest, the quantiles have been determined, and the closest corresponding z-scores have been found using a table of standard normal probabilities.
Year | Percentage | Quantile | z-score | |
---|---|---|---|---|
1996 | 74.1 | \begin{align*}\frac{1}{13}=0.078\end{align*} | \begin{align*}-1.42\end{align*} | |
1997 | 74.4 | \begin{align*}\frac{2}{13}=0.154\end{align*} | \begin{align*}-1.02\end{align*} | |
1998 | 74.4 | \begin{align*}\frac{3}{13}=0.231\end{align*} | \begin{align*}-0.74\end{align*} | |
1999 | 75.1 | \begin{align*}\frac{4}{13}=0.286\end{align*} | \begin{align*}-0.56\end{align*} | |
2000 | 75.1 | \begin{align*}\frac{5}{13}=0.385\end{align*} | \begin{align*}-0.29\end{align*} | |
1995 | 75.3 | \begin{align*}\frac{6}{13}=0.462\end{align*} | \begin{align*}-0.09\end{align*} | |
2001 | 75.5 | \begin{align*}\frac{7}{13}=0.538\end{align*} | \begin{align*}0.1\end{align*} | |
2002 | 75.5 | \begin{align*}\frac{8}{13}=0.615\end{align*} | \begin{align*}0.29\end{align*} | |
1991 | 75.7 | \begin{align*}\frac{9}{13}=0.692\end{align*} | \begin{align*}0.50\end{align*} | |
1994 | 75.7 | \begin{align*}\frac{10}{13}=0.769\end{align*} | \begin{align*}0.74\end{align*} | |
2003 | 75.8 | \begin{align*}\frac{11}{13}=0.846\end{align*} | \begin{align*}1.02\end{align*} | |
1993 | 76.1 | \begin{align*}\frac{12}{13}=0.923\end{align*} | \begin{align*}1.43\end{align*} | |
1992 | 76.9 | \begin{align*}\frac{13}{13}=1\end{align*} | \begin{align*}3.49\end{align*} |
Figure: Table of quantiles and corresponding \begin{align*}z\end{align*}-scores for senior no-accident data.
Here is a plot of the percentages versus the \begin{align*}z\end{align*}-scores of their quantiles, or the normal probability plot:
Remember that you can simplify this process by simply entering the percentages into a \begin{align*}L_1\end{align*} in your calculator and selecting the normal probability plot option (the last type of plot) in STAT PLOT.
While not perfectly linear, this plot does have a strong linear pattern, and we would, therefore, conclude that the distribution is reasonably normal.
Technology Note: Investigating the Normal Distribution on a TI-83/84 Graphing Calculator
We can graph a normal curve for a probability distribution on the TI-83/84 calculator. To do so, first press [Y=]. To create a normal distribution, we will draw an idealized curve using something called a density function. The command is called 'normalpdf(', and it is found by pressing [2nd][DISTR][1]. Enter an X to represent the random variable, followed by the mean and the standard deviation, all separated by commas. For this example, choose a mean of 5 and a standard deviation of 1.
Adjust your window to match the following settings and press [GRAPH].
Press [2ND][QUIT] to go to the home screen. We can draw a vertical line at the mean to show it is in the center of the distribution by pressing [2ND][DRAW] and choosing 'Vertical'. Enter the mean, which is 5, and press [ENTER].
Remember that even though the graph appears to touch the \begin{align*}x\end{align*}-axis, it is actually just very close to it.
In your Y= Menu, enter the following to graph 3 different normal distributions, each with a different standard deviation:
This makes it easy to see the change in spread when the standard deviation changes.
Example
Example 1
On a nationwide math test, the mean was 65 and the standard deviation was 10. If Robert scored 81, what was his \begin{align*}z\end{align*}-score?
\begin{align*}z&=\frac{x-\mu}{\sigma}\\ z&=\frac{81-65}{10}\\ z&=\frac{16}{10}\\ z&=1.6\end{align*}
Robert's \begin{align*}z\end{align*}-score is 1.6, which means that he scored 1.6 standard deviations above the mean.
Review
- Which of the following data sets is most likely to be normally distributed? For the other choices, explain why you believe they would not follow a normal distribution.
- The hand span (measured from the tip of the thumb to the tip of the extended \begin{align*}5^{\text{th}}\end{align*} finger) of a random sample of high school seniors
- The annual salaries of all employees of a large shipping company
- The annual salaries of a random sample of 50 CEOs of major companies, 25 women and 25 men
- The dates of 100 pennies taken from a cash drawer in a convenience store
- The grades on a statistics mid-term for a high school are normally distributed, with \begin{align*}\mu=81\end{align*} and \begin{align*}\sigma=6.3\end{align*}. Calculate the \begin{align*}z\end{align*}-scores for each of the following exam grades. Draw and label a sketch for each example. 65, 83, 93, 100
- Assume that the mean weight of 1-year-old girls in the USA is normally distributed, with a mean of about 9.5 kilograms and a standard deviation of approximately 1.1 kilograms. Without using a calculator, estimate the percentage of 1-year-old girls who meet the following conditions. Draw a sketch and shade the proper region for each problem.
- Less than 8.4 kg
- Between 7.3 kg and 11.7 kg
- More than 12.8 kg
- For a standard normal distribution, place the following in order from smallest to largest.
- The percentage of data below 1
- The percentage of data below \begin{align*}-1\end{align*}
- The mean
- The standard deviation
- The percentage of data above 2
- The 2007 AP Statistics examination scores were not normally distributed, with \begin{align*}\mu=2.8\end{align*} and \begin{align*}\sigma=1.34\end{align*}. What is the approximate \begin{align*}z\end{align*}-score that corresponds to an exam score of 5? (The scores range from 1 to 5.)
- 0.786
- 1.46
- 1.64
- 2.20
- A \begin{align*}z\end{align*}-score cannot be calculated because the distribution is not normal.
- How can we use normal distributions to make meaningful conclusions about samples and experiments?
- How do we calculate probabilities and areas under the normal curve that are not covered by the Empirical Rule?
- What are the other types of distributions that can occur in different probability situations?
- The heights of \begin{align*}5^{\text{th}}\end{align*} grade boys in the USA is approximately normally distributed, with a mean height of 143.5 cm and a standard deviation of about 7.1 cm. What is the probability that a randomly chosen \begin{align*}5^{\text{th}}\end{align*} grade boy would be taller than 157.7 cm?
- A statistics class bought some sprinkle (or jimmies) doughnuts for a treat and noticed that the number of sprinkles seemed to vary from doughnut to doughnut, so they counted the sprinkles on each doughnut. Here are the results: 241, 282, 258, 223, 133, 335, 322, 323, 354, 194, 332, 274, 233, 147, 213, 262, 227, and 366.
- Create a histogram, dot plot, or box plot for this data. Comment on the shape, center and spread of the distribution.
- Find the mean and standard deviation of the distribution of sprinkles. Complete the following chart by standardizing all the values:
\begin{align*}\mu= \ \ \sigma= \end{align*}
Number of Sprinkles | Quantile | \begin{align*}z\end{align*}-score |
---|---|---|
241 | ||
282 | ||
258 | ||
223 | ||
133 | ||
335 | ||
322 | ||
323 | ||
354 | ||
194 | ||
332 | ||
274 | ||
233 | ||
147 | ||
213 | ||
262 | ||
227 | ||
366 |
Figure: A table to be filled in for the sprinkles question.
c. Create a normal probability plot from your results.
d. Based on this plot, comment on the normality of the distribution of sprinkle counts on these doughnuts.
References: SUNY at Albany
- Draw each of the following distributions accurately on one set of axes.
Distribution | Form | Mean | Standard Deviation |
---|---|---|---|
A | Normal | 30 | 5 |
B | Normal | 35 | 3 |
C | Normal | 24 | 12 |
- In a school, the children’s heights follow a normal distribution with an average of 55 inches and a variance of 9 square inches.
- What is \begin{align*}\mu\end{align*}?
- What is \begin{align*}\sigma\end{align*}?
- Is the curve \begin{align*}N(55,3)\end{align*} or \begin{align*}N(55,9)\end{align*}?
- The \begin{align*}N(36,9)\end{align*} distribution has a mean=? and SD=?
- The \begin{align*}N(9,36)\end{align*} distribution has a mean=? and SD=?
- For each of the following, calculate the standardized score (or z-score) for the value x:
- \begin{align*}\mu=0, \sigma=1, x=2\end{align*}
- \begin{align*}\mu=9, \sigma=5, x=3\end{align*}
- \begin{align*}\mu=9, \sigma=4, x=0\end{align*}
- \begin{align*}\mu=-9, \sigma=14, x=-20\end{align*}
- Draw the curve corresponding to each of the following random variables and then shade the area corresponding to the given probability. You do NOT have to compute the probability.
- X is a normal random variable with mean = 80 and standard deviation = 5. \begin{align*}P(70 < X < 90)\end{align*}
- X is a normal random variable with mean of 20 and standard deviation of 10. \begin{align*}P(-10 < X < 15)\end{align*}
- State the empirical rule.
- Use the empirical rule to determine what percentage of a normally distributed population is more than 3 standard deviations below the mean.
- Suppose that adult women’s heights are normally distributed with a mean of 65 inches and a standard deviation of 2 inches.
- Use the empirical rule to determine what percent of adult women have heights between 65 inches and 67 inches.
- Use the empirical rule to determine the proportion of adult women who have heights greater than 69 inches.
- Using the empirical rule, what is the probability that a randomly selected adult woman is more than 63 inches tall?
- What is the area under the curve between 59 inches and 67 inches?
- Given a group of data with mean 70 and standard deviation 12, at least what percent of the data will fall between 70 and 94?
- Given a set of data that is bell shaped with a mean of -690. It has a standard deviation of 25. What percentage of the data should lie between -752 and -648?
- Given a set of data that is bell-shaped with a mean of 890. If 68% of the data lies between 850 and 930 then what is the standard deviation?
- If a group of data is bell shaped with a mean of -25 and a standard deviation of 65.3 what is the interval that should contain at least 95% of the data?
- Consider the following data set. Do you think it is a sample from a normally distributed population? Explain \begin{align*} &24.0 &&7.9 &&1.5 &&0.0 &&0.3 &&0.4 &&8.1 &&4.3 &&0.0 &&0.5\\ &3.6 &&2.9 &&0.4 &&2.6 &&0.1 &&3.6 &&2.9 &&0.4 &&2.6 &&0.1\\ &16.6 &&1.4 &&23.8 &&25.1 &&1.6 &&12.2 &&14.8 &&0.4 &&3.7 &&4.2 \end{align*}
- Consider the following data set. Do you think it is a sample from a normally distributed population? Explain. \begin{align*} &26 &&24 &&22 &&25 &&23 &&24 &&25 &&23 &&25 &&22\\ &21 &&26 &&22 &&23 &&24 &&25 &&24 &&25 &&24 &&25\\ &26 &&21 &&22 &&24 &&24 \end{align*}
Review (Answers)
To view the Review answers, open this PDF file and look for section 5.1.