1.3: Measures of Center
Learning Objectives
- Calculate the mode, median, and mean for a set of data, and understand the differences between each measure of center.
- Identify the symbols and know the formulas for sample and population means.
- Determine the values in a data set that are outliers.
- Identify the values to be removed from a data set for an \begin{align*}n\end{align*}
n % trimmed mean. - Calculate the midrange, weighted mean, percentiles, and quartiles for a data set.
Introduction
This lesson is an overview of some of the basic statistics used to measure the center of a set of data.
Measures of Central Tendency
Once data are collected, it is useful to summarize the data set by identifying a value around which the data are centered. Three commonly used measures of center are the mode, the median, and the mean.
Mode
The mode is defined as the most frequently occurring number in a data set. The mode is most useful in situations that involve categorical (qualitative) data that are measured at the nominal level. In the last chapter, we referred to the data with the Galapagos tortoises and noted that the variable 'Climate Type' was such a measurement. For this example, the mode is the value 'humid'.
Example: The students in a statistics class were asked to report the number of children that live in their house (including brothers and sisters temporarily away at college). The data are recorded below:
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6
In this example, the mode could be a useful statistic that would tell us something about the families of statistics students in our school. In this case, 2 is the mode, as it is the most frequently occurring number of children in the sample, telling us that most students in the class come from families where there are 2 children.
If there were seven 3-child households and seven 2-child households, we would say the data set has two modes. In other words, the data would be bimodal. When a data set is described as being bimodal, it is clustered about two different modes. Technically, if there were more than two, they would all be the mode. However, the more of them there are, the more trivial the mode becomes. In these cases, we would most likely search for a different statistic to describe the center of such data.
If there is an equal number of each data value, the mode is not useful in helping us understand the data, and thus, we say the data set has no mode.
Mean
Another measure of central tendency is the arithmetic average, or mean. This value is calculated by adding all the data values and dividing the sum by the total number of data points. The mean is the numerical balancing point of the data set.
We can illustrate this physical interpretation of the mean. Below is a graph of the class data from the last example.
If you have snap cubes like you used to use in elementary school, you can make a physical model of the graph, using one cube to represent each student’s family and a row of six cubes at the bottom to hold them together, like this:
There are 22 students in this class, and the total number of children in all of their houses is 55, so the mean of this data is \begin{align*}\frac{55}{22}=2.5\end{align*}
It turns out that the model that you created balances at 2.5. In the pictures below, you can see that a block placed at 3 causes the graph to tip left, while one placed at 2 causes the graph to tip right. However, if you place the block at 2.5, it balances perfectly!
Symbolically, the formula for the sample mean is as follows:
\begin{align*}\overline{x}= \frac{\sum_{i=1}^n x_i}{n} = \frac{x_1+x_2+\ldots+x_n}{n}\end{align*}
where:
\begin{align*}x_i\end{align*} is the \begin{align*}i^{\text{th}}\end{align*} data value of the sample.
\begin{align*}n\end{align*} is the sample size.
The mean of the population is denoted by the Greek letter, \begin{align*}\mu\end{align*}.
\begin{align*}\overline{x}\end{align*} is a statistic, since it is a measure of a sample, and \begin{align*}\mu\end{align*} is a parameter, since it is a measure of a population. \begin{align*}\overline{x}\end{align*} is an estimate of \begin{align*}\mu\end{align*}.
Median
The median is simply the middle number in an ordered set of data.
Suppose a student took five statistics quizzes and received the following grades:
80, 94, 75, 96, 90
To find the median, you must put the data in order. The median will be the data point that is in the middle. Placing the data in order from least to greatest yields: 75, 80, 90, 94, 96.
The middle number in this case is the third grade, or 90, so the median of this data is 90.
When there is an even number of numbers, no one of the data points will be in the middle. In this case, we take the average (mean) of the two middle numbers.
Example: Consider the following quiz scores: 91, 83, 97, 89
Place them in numeric order: 83, 89, 91, 97.
The second and third numbers straddle the middle of this set. The mean of these two numbers is 90, so the median of the data is 90.
Mean vs. Median
Both the mean and the median are important and widely used measures of center. Consider the following example: Suppose you got an 85 and a 93 on your first two statistics quizzes, but then you had a really bad day and got a 14 on your next quiz!
The mean of your three grades would be 64. Which is a better measure of your performance? As you can see, the middle number in the set is an 85. That middle does not change if the lowest grade is an 84, or if the lowest grade is a 14. However, when you add the three numbers to find the mean, the sum will be much smaller if the lowest grade is a 14.
Outliers and Resistance
The mean and the median are so different in this example because there is one grade that is extremely different from the rest of the data. In statistics, we call such extreme values outliers. The mean is affected by the presence of an outlier; however, the median is not. A statistic that is not affected by outliers is called resistant. We say that the median is a resistant measure of center, and the mean is not resistant. In a sense, the median is able to resist the pull of a far away value, but the mean is drawn to such values. It cannot resist the influence of outlier values. As a result, when we have a data set that contains an outlier, it is often better to use the median to describe the center, rather than the mean.
Example: In 2005, the CEO of Yahoo, Terry Semel, was paid almost $231,000,000 (see http://www.forbes.com/static/execpay2005/rank.html). This is certainly not typical of what the average worker at Yahoo could expect to make. Instead of using the mean salary to describe how Yahoo pays its employees, it would be more appropriate to use the median salary of all the employees.
You will often see medians used to describe the typical value of houses in a given area, as the presence of a very few extremely large and expensive homes could make the mean appear misleadingly large.
Other Measures of Center
Midrange
The midrange (sometimes called the midextreme) is found by taking the mean of the maximum and minimum values of the data set.
Example: Consider the following quiz grades: 75, 80, 90, 94, and 96. The midrange would be:
\begin{align*}\frac{75+96}{2}= \frac{171}{2} = 85.5\end{align*}
Since it is based on only the two most extreme values, the midrange is not commonly used as a measure of central tendency.
Trimmed Mean
Recall that the mean is not resistant to the effects of outliers. Many students ask their teacher to “drop the lowest grade.” The argument is that everyone has a bad day, and one extreme grade that is not typical of the rest of their work should not have such a strong influence on their mean grade. The problem is that this can work both ways; it could also be true that a student who is performing poorly most of the time could have a really good day (or even get lucky) and get one extremely high grade. We wouldn’t blame this student for not asking the teacher to drop the highest grade! Attempting to more accurately describe a data set by removing the extreme values is referred to as trimming the data. To be fair, though, a valid trimmed statistic must remove both the extreme maximum and minimum values. So, while some students might disapprove, to calculate a trimmed mean, you remove the maximum and minimum values and divide by the number of values that remain.
Example: Consider the following quiz grades: 75, 80, 90, 94, 96.
A trimmed mean would remove the largest and smallest values, 75 and 96, and divide by 3.
\begin{align*}&\xcancel{75},80,90,94,\xcancel{96}\\ &\frac{80+90+94}{3}=88\end{align*}
\begin{align*}n\%\end{align*} Trimmed Mean
Instead of removing just the minimum and maximums in a larger data set, a statistician may choose to remove a certain percentage of the extreme values. This is called an \begin{align*}n\%\end{align*} trimmed mean. To perform this calculation, remove the specified percent of the number of values from the data, half on each end. For example, in a data set that contains 100 numbers, to calculate a 10% trimmed mean, remove 10% of the data, 5% from each end. In this simplified example, the five smallest and the five largest values would be discarded, and the sum of the remaining numbers would be divided by 90.
Example: In real data, it is not always so straightforward. To illustrate this, let’s return to our data from the number of children in a household and calculate a 10% trimmed mean. Here is the data set:
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6
Placing the data in order yields the following:
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6
Ten percent of 22 values is 2.2, so we could remove 2 numbers, one from each end (2 total, or approximately 9% trimmed), or we could remove 2 numbers from each end (4 total, or approximately 18% trimmed). Some statisticians would calculate both of these and then use proportions to find an approximation for 10%. Others might argue that 9% is closer, so we should use that value. For our purposes, and to stay consistent with the way we handle similar situations in later chapters, we will always opt to remove more numbers than necessary. The logic behind this is simple. You are claiming to remove 10% of the numbers. If you cannot remove exactly 10%, then you either have to remove more or fewer. We would prefer to err on the side of caution and remove at least the percentage reported. This is not a hard and fast rule and is a good illustration of how many concepts in statistics are open to individual interpretation. Some statisticians even say that the only correct answer to every question asked in statistics is, “It depends!”
Weighted Mean
The weighted mean is a method of calculating the mean where instead of each data point contributing equally to the mean, some data points contribute more than others. This could be because they appear more often or because a decision was made to increase their importance (give them more weight). The most common type of weight to use is the frequency, which is the number of times each number is observed in the data. When we calculated the mean for the children living at home, we could have used a weighted mean calculation. The calculation would look like this:
\begin{align*}\frac{(5)(1)+(8)(2)+(5)(3)+(2)(4)+(1)(5)+(1)(6)}{22}\end{align*}
The symbolic representation of this is as follows:
\begin{align*}\overline{x}=\frac{\sum_{i=1}^n f_ix_i}{\sum_{i=1}^n f_i}\end{align*}
where:
\begin{align*}x_i\end{align*} is the \begin{align*}i^{\text{th}}\end{align*} data point.
\begin{align*}f_i\end{align*} is the number of times that data point occurs.
\begin{align*}n\end{align*} is the number of data points.
Percentiles and Quartiles
A percentile is a statistic that identifies the percentage of the data that is less than the given value. The most commonly used percentile is the median. Because it is in the numeric middle of the data, half of the data is below the median. Therefore, we could also call the median the \begin{align*}50^{\text{th}}\end{align*} percentile. A \begin{align*}40^{\text{th}}\end{align*} percentile would be a value in which 40% of the numbers are less than that observation.
Example: To check a child’s physical development, pediatricians use height and weight charts that help them to know how the child compares to children of the same age. A child whose height is in the \begin{align*}70^{\text{th}}\end{align*} percentile is taller than 70% of children of the same age.
Two very commonly used percentiles are the \begin{align*}25^{\text{th}}\end{align*} and \begin{align*}75^{\text{th}}\end{align*} percentiles. The median, \begin{align*}25^{\text{th}}\end{align*}, and \begin{align*}75^{\text{th}}\end{align*} percentiles divide the data into four parts. Because of this, the \begin{align*}25^{\text{th}}\end{align*} percentile is notated as \begin{align*}Q_1\end{align*} and is called the lower quartile, and the \begin{align*}75^{\text{th}}\end{align*} percentile is notated as \begin{align*}Q_3\end{align*} and is called the upper quartile. The median is a middle quartile and is sometimes referred to as \begin{align*}Q_2\end{align*}.
Example: Let's return to the previous data set, which is as follows:
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6
Recall that the median (\begin{align*}50^{\text{th}}\end{align*} percentile) is 2. The quartiles can be thought of as the medians of the upper and lower halves of the data.
In this case, there are an odd number of values in each half. If there were an even number of values, then we would follow the procedure for medians and average the middle two values of each half. Look at the set of data below:
The median in this set is 90. Because it is the middle number, it is not technically part of either the lower or upper halves of the data, so we do not include it when calculating the quartiles. However, not all statisticians agree that this is the proper way to calculate the quartiles in this case. As we mentioned in the last section, some things in statistics are not quite as universally agreed upon as in other branches of mathematics. The exact method for calculating quartiles is another one of these topics. To read more about some alternate methods for calculating quartiles in certain situations, click on the subsequent link.
On the Web
http://mathforum.org/library/drmath/view/60969.html
Lesson Summary
When examining a set of data, we use descriptive statistics to provide information about where the data are centered. The mode is a measure of the most frequently occurring number in a data set and is most useful for categorical data and data measured at the nominal level. The mean and median are two of the most commonly used measures of center. The mean, or average, is the sum of the data points divided by the total number of data points in the set. In a data set that is a sample from a population, the sample mean is denoted by \begin{align*}\overline{x}\end{align*}. The population mean is denoted by \begin{align*}\mu\end{align*}. The median is the numeric middle of a data set. If there are an odd number of data points, this middle value is easy to find. If there is an even number of data values, the median is the mean of the middle two values. An outlier is a number that has an extreme value when compared with most of the data. The median is resistant. That is, it is not affected by the presence of outliers. The mean is not resistant, and therefore, the median tends to be a more appropriate measure of center to use in examples that contain outliers. Because the mean is the numerical balancing point for the data, it is an extremely important measure of center that is the basis for many other calculations and processes necessary for making useful conclusions about a set of data.
Another measure of center is the midrange, which is the mean of the maximum and minimum values. In an \begin{align*}n\%\end{align*} trimmed mean, you remove a certain \begin{align*}n\end{align*} percentage of the data (half from each end) before calculating the mean. A weighted mean involves multiplying individual data values by their frequencies or percentages before adding them and then dividing by the total of the frequencies (weights).
A percentile is a data value for which the specified percentage of the data is below that value. The median is the \begin{align*}50^{\text{th}}\end{align*} percentile. Two well-known percentiles are the \begin{align*}25^{\text{th}}\end{align*} percentile, which is called the lower quartile, \begin{align*}Q_1\end{align*}, and the \begin{align*}75^{\text{th}}\end{align*} percentile, which is called the upper quartile, \begin{align*}Q_3\end{align*}.
Points to Consider
- How do you determine which measure of center best describes a particular data set?
- What are the effects of outliers on the various measures of spread?
- How can we represent data visually using the various measures of center?
Multimedia Links
For a discussion of four measures of central tendency (5.0), see American Public University, Data Distributions - Measures of a Center (6:24).
For an explanation and examples of mean, median and mode (10.0), see keithpeterb, Mean, Mode and Median from Frequency Tables (7:06).
Review Questions
- In Lois’ \begin{align*}2^{\text{nd}}\end{align*} grade class, all of the students are between 45 and 52 inches tall, except one boy, Lucas, who is 62 inches tall. Which of the following statements is true about the heights of all of the students?
- The mean height and the median height are about the same.
- The mean height is greater than the median height.
- The mean height is less than the median height.
- More information is needed to answer this question.
- None of the above is true.
- Enrique has a 91, 87, and 95 for his statistics grades for the first three quarters. His mean grade for the year must be a 93 in order for him to be exempt from taking the final exam. Assuming grades are rounded following valid mathematical procedures, what is the lowest whole number grade he can get for the \begin{align*}4^{\text{th}}\end{align*} quarter and still be exempt from taking the exam?
- How many data points should be removed from each end of a sample of 300 values in order to calculate a 10% trimmed mean?
- 5
- 10
- 15
- 20
- 30
- In the last example, after removing the correct numbers and summing those remaining, what would you divide by to calculate the mean?
- The chart below shows the data from the Galapagos tortoise preservation program with just the number of individual tortoises that were bred in captivity and reintroduced into their native habitat.
Island or Volcano | Number of Individuals Repatriated |
---|---|
Wolf | 40 |
Darwin | 0 |
Alcedo | 0 |
Sierra Negra | 286 |
Cerro Azul | 357 |
Santa Cruz | 210 |
Española | 1293 |
San Cristóbal | 55 |
Santiago | 498 |
Pinzón | 552 |
Pinta | 0 |
Figure: Approximate Distribution of Giant Galapagos Tortoises in 2004 (“Estado Actual De Las Poblaciones de Tortugas Terrestres Gigantes en las Islas Galápagos,” Marquez, Wiedenfeld, Snell, Fritts, MacFarland, Tapia, y Nanjoa, Scologia Aplicada, Vol. 3, Num. 1,2, pp. 98-11).
For this data, calculate each of the following:
(a) mode
(b) median
(c) mean
(d) a 10% trimmed mean
(e) midrange
(f) upper and lower quartiles
(g) the percentile for the number of Santiago tortoises reintroduced
- In the previous question, why is the answer to (c) significantly higher than the answer to (b)?
On the Web
http://edhelper.com/statistics.htm
http://en.wikipedia.org/wiki/Arithmetic_mean
Java Applets helpful to understand the relationship between the mean and the median:
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
http://www.shodor.org/interactivate/activities/PlopIt/
Technology Notes: Calculating the Mean on the TI-83/84 Graphing Calculator
Step 1: Entering the data
On the home screen, press [2ND][{], and then enter the following data separated by commas. When you have entered all the data, press [2ND][}][STO][2ND][L1][ENTER]. You will see the screen on the left below:
1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6
Step 2: Computing the mean
On the home screen, press [2ND][LIST] to enter the LIST menu, press the right arrow twice to go to the MATH menu (the middle screen above), and either arrow down and press [ENTER] or press [3] for the mean. Finally, press [2ND][L1][)] to insert L1 and press [ENTER] (see the screen on the right above).
Calculating Weighted Means on the TI-83/84 Graphing Calculator
Use the data of the number of children in a family. In list L1, enter the number of children, and in list L2, enter the frequencies, or weights.
The data should be entered as shown in the left screen below:
Press [2ND][STAT] to enter the LIST menu, press the right arrow twice to go to the MATH menu (the middle screen above), and either arrow down and press [ENTER] or press [3] for the mean. Finally, press [2ND][L1][,][2ND][L2][)][ENTER], and you will see the screen on the right above. Note that the mean is 2.5, as before.
Calculating Medians and Quartiles on the TI-83/84 Graphing Calculator
The median and quartiles can also be calculated using a graphing calculator. You may have noticed earlier that median is available in the MATH submenu of the LIST menu (see below).
While there is a way to access each quartile individually, we will usually want them both, so we will access them through the one-variable statistics in the STAT menu.
You should still have the data in L1 and the frequencies, or weights, in L2, so press [STAT], and then arrow over to CALC (the left screen below) and press [ENTER] or press [1] for '1-Var Stats', which returns you to the home screen (see the middle screen below). Press [2ND][L1][,][2ND][L2][ENTER] for the data and frequency lists (see third screen). When you press [ENTER], look at the bottom left hand corner of the screen (fourth screen below). You will notice there is an arrow pointing downward to indicate that there is more information. Scroll down to reveal the quartiles and the median (final screen below).
Remember that \begin{align*}Q_1\end{align*} corresponds to the \begin{align*}25^{\text{th}}\end{align*} percentile, and \begin{align*}Q_3\end{align*} corresponds to the \begin{align*}75^{\text{th}}\end{align*} percentile.