<meta http-equiv="refresh" content="1; url=/nojavascript/"> Measures of Center | CK-12 Foundation
Dismiss
Skip Navigation
You are reading an older version of this FlexBook® textbook: CK-12 Probability and Statistics - Advanced Go to the latest version.

Learning Objectives

  • Calculate the mode, median, and mean for a set of data, and understand the differences between each measure of center.
  • Identify the symbols and know the formulas for sample and population means.
  • Determine the values in a data set that are outliers
  • Identify the values to be removed from a data set for an n-\mathrm{percent} trimmed mean.
  • Calculate the midrange, weighted mean, percentiles, and quartiles.

Introduction

This lesson is an overview of some of the basic statistics used to measure the center of a set of data.

Mode, Mean, and Median

In the last lesson, you learned that it makes sense to summarize a data set by identifying a value around which the data is centered. Three commonly used statistics that quantify the idea of center are the mode, median and mean.

Mode

The mode is defined as the most frequently occurring number in a data set. While many elementary school children learn the mode as their first introduction to measures of center, as you delve deeper into statistics, you will most likely encounter it less frequently. The mode really only has significance for data measured at the most basic of levels. The mode is most useful in situations that involve categorical (qualitative) data that is measured at the nominal level. In the last chapter, we referred to the data with the Galapagos tortoises and noted that the variable “Climate Type” was such a measurement. For this example, the mode is the value “humid.”

Example:

The students in a statistics class were asked to report the number of children that live in their house (including brothers and sisters temporarily away at college). The data is recorded below:

1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6

In this example, the mode could be a useful statistic that would tell us something about the families of statistics students in our school. In this case, 2 is the mode as it is the most frequently occurring number of children in the sample, telling us that a large number of students in our class have 2 children in their home.

Notice how careful we are to NOT apply this to a larger population and assume that this will be true for any population other than our class! In a later chapter, you will learn how to correctly select a sample that could represent a broader population.

Two Issues with the Mode

  1. If there is more than one number that is the most frequent than the mode is usually both of those numbers. For example, if there were seven 3-child households and seven with 2 children, we would say that the mode is, “2 and 3.” When data is described as being bimodal, it is clustered about two different modes. Technically, if there were more than two, they would all be the mode. However, the more of them there are, the more trivial the mode becomes. In those cases, we would most likely search for a different statistic to describe the center of such data.
  2. If each data value occurs an equal number of times, we usually say, “There is no mode.” Again, this is a case where the mode is not at all useful in helping us to understand the behavior of the data.

Do You Mean the Average?

You are probably comfortable calculating averages. The average is a measure of center that statisticians call the mean. Most students learn early on in their studies that you calculate the mean by adding all of the numbers and dividing by the number of numbers. While you are expected to be able to perform this calculation, most real data sets that statisticians deal with are so large that they very rarely calculate a mean by hand. It is much more critical that you understand why the mean is such an important measure of center. The mean is actually the numerical “balancing point” of the data set.

We can illustrate this physical interpretation of the mean. Below is a graph of the class data from the last example.

If you have snap cubes like you used to use in elementary school, you can make a physical model of the graph, using one cube to represent each student’s family and a row of six cubes at the bottom to hold them together like this:

There are 22 students in this class and the total number of children in all of their houses is 55, so the mean of this data is 55 \div 22 = 2.5. Statisticians use the symbol \overline {X} to represent the mean when X is the symbol for a single measurement. It is pronounced “x bar.”

It turns out that the model that you created balances at 2.5. In the pictures below, you can see that a block placed at 3 causes the graph to tip left, and while one placed at 2 causes the graph to tip right. However, if you place the block at about 2.5, it balances perfectly!

Technology Note: Use the TI-83/84, and Mean it!

As was already mentioned, once you understand how to calculate a mean, and unless you need practice with your arithmetic skills, you rarely calculate them by hand. Here is how to calculate a mean with the TI-83/4 family of graphing calculators.

Step 1: Entering the data

On the home screen, press [2nd] [ { ], then enter the data separated by commas. When you have entered all the data, press [2nd] [ } ] [sto] [2nd] [L1] [enter]. You will see the screen on the left below:

1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6

Step 2: Computing the mean

On the home screen, press [2nd] [LIST] to enter the list menu, press ([rightarrow]) once to go to the MATH menu (the middle screen above), and either arrow down or choose 3 for the mean. Finally, press [2nd] [L1] [ ) ] to insert L1 and press [enter] (see the screen on the right above).

Right Down the Middle: The Median

The median is simply the middle number in a set of data. Think of 5 students seated in a row in statistics class:

Aliyah Bob Catalina David Elaine

Which student is sitting in the middle? If there were only four students, what would be the middle of the row? These are the same issues you face when calculating the numeric middle of a data set using the median.

Let’s say that Ron has taken five quizzes in his statistics class and received the following grades:

80, 94, 75, 90, 96

Before finding the median, you must put the data in order. The median is the numeric middle. Placing the data in order from least to greatest yields:

75, 80, 90, 94, 96

The middle number in this case is the third grade, or 90, so the median of this data is 90. Notice that just by coincidence, this was also the third quiz that he took, but this will usually not be the case.

Of course, when there is an even number of numbers, there is no true value in the middle. In this case we take the two middle numbers and find their mean. If there are four students sitting in a row, the middle of the row is halfway between the second and third students.

Example

Take Rhonda’s quiz grades:

91, 83, 97, 89

Place them in numeric order:

83, 89, 91, 97

The second and third numbers “straddle” the middle of this set. The mean of these two numbers is 90, so the median of the data is 90.

Mean vs. Median

Both the mean and the median are important and widely used measures of center. So you might wonder why we need them both. There is an important difference between them that can be explained by the following example.

Let’s say that you get an 85 and a 93 on your first two statistics quizzes, but then you had a really bad day and got a 14 on your next quiz!!!

The mean of your three grades would be a 64! What would the median be? Which is a better measure of your performance? As you can see, the middle number in the set is an 85. That middle does not change if the lowest grade is an 84, or if the lowest grade is a 14. However, when you add the three numbers to find the mean, the sum will be much smaller if the lowest grade is a 14. If you divide a much smaller sum by 3, the mean will also be much smaller.

Outliers and Resistance

So, why are the mean and median so different in this example? It is because there is one grade that is extremely different from the rest of the data. In statistics, we call such extreme values outliers. The mean is affected by the presence of an outlier; however, the median is not. A statistic that is not affected by outliers is called resistant. We say that the median is a resistant measure of center, and the mean is not resistant. In a sense, the median is able to resist the pull of a far away value, but the mean is drawn to such values. It cannot resist the influence of outlier values. Remember the balancing point example? If you created another number that was far away, you would be forced to move the block toward it to make it stay balanced.

As a result, when we have a data set that contains an outlier, it is often better to use the median to describe the center, rather than the mean. For example, in 2005 the CEO of Yahoo, Terry Semel, was paid almost \$ 231\;\mathrm{million,} see http://www.forbes.com/static/execpay2005/rank.html. This is certainly not typical of what the “average” worker at Yahoo could expect to make. Instead of using the mean salary to describe how Yahoo pays its employees, it would be more appropriate to use the median salary of all the employees. You will often see medians used to describe the typical value of houses in a given area, as the presence of a very few extremely large and expensive homes could make the mean appear misleadingly large.

Population Mean vs. Sample Mean

Now that we understand some basic concepts about the mean, it is important to be able to represent and understand the mean symbolically. When you are calculating the mean as a statistic from a finite sample of data, we call this the sample mean and as we have already mentioned, the symbol for this is  \overline{X}. Written symbolically then, the formula for a sample mean is:

\bar{x} = \frac{\sum(x_1 + x_2 + \cdots + x_n)} {n}

You may have remembered seeing the symbol \sum before on a calculator or in another mathematics class. It is called “sigma,” the Greek capital S. In mathematics, we use this symbol as a shortcut for “the sum of”. So, the formula is the sum of all the data values (x_1, x_2, etc.) divided by the number of observations (n).

Recall that the mean of an entire population is a parameter. The symbol for a population mean is another Greek letter, \mu. It is the lowercase Greek m and is called “mu” (pronounced “mew”, like the sound a cat makes). In this case the symbolic representation would be:

\mu = \frac{\sum(X_1 + X_2 + \cdots + X_n)} {N}

The formula is very much the same, because we calculate the mean the same way, but we typically use capital X for the individuals in the population and capital N to represent the size of the population.

In general, statisticians say that \overline{x}, the mean of a portion of the population is an estimate of \mu, the mean of the population, which is usually unknown. In this course you will learn to determine how good that estimate is.

Other Measures of Center

There are many other lesser-known measures of center that can prove useful in describing certain data sets. We will highlight a few of them in this section.

Midrange

The midrange (sometimes called the midextreme), is found by taking the mean of the maximum and minimum values of the data set.

In a previous example we used the following data from Ron’s grades:

75, 80, 90, 94, 96

The midrange would be:

\frac{(75 + 96)} {2} = \frac{171} {2} = 85.5

One of the reasons that the midrange is not commonly used is that it is only based on two values of the data set, and not just any two, but the values that are most likely to be outliers! It would be like basing your class grade on only two assessments and ignoring all the other work you may have done. Even if it works out as a higher grade for you, much of your accomplishments would be meaningless!

Trimmed Mean

Remember that the mean is not resistant to the effects of outliers. Many students ask their teacher to “drop the lowest grade.” The argument is that everyone has a bad day, and one extreme grade that is not typical of the rest of their work should not have such a strong influence on their mean grade. The problem is that this can work both ways; it could also be true that a student who is performing poorly most of the time could have a really good day (or even get lucky) and get one extremely high grade. We wouldn’t blame this student for not asking the teacher to drop the highest grade! Attempting to more accurately describe a data set by removing the extreme values is referred to as trimming the data. To be fair though, a valid trimmed statistic must remove both the extreme maximum and minimum values. So, while some students might disapprove, to calculate a trimmed mean, you remove the maximum and minimum values and divide by the number of numbers that remain.

Let’s go back to Ron’s grades again:

75, 80, 90, 94, 96

A trimmed mean would remove the largest and smallest values, 75 and 96, and divide by 3.

& \xcancel{75}, 80, 90, 94, \xcancel{96}\\& \frac{(80 + 90 + 4} {3} = 88

n% Trimmed Mean

Instead of removing just the minimum and maximums in a larger data set, a statistician may choose to remove a certain percentage of the extreme values. This is called an n\% trimmed mean. To perform this calculation, you would remove the specified percent of the number of values from the data, half on each end. For example, in a data set that contained 100 numbers, if a researcher wanted to calculate a 10\% trimmed mean, she would need to remove 10\% of the data, or 5\% from each end. In this simplified example, the five smallest and the five largest values would be discarded and the sum of the remaining numbers would be divided by 90.

In “real” data, it is not always so straightforward. To illustrate this, let’s return to our data from the number of children in a household and calculate a 10\% trimmed mean. Here is the data set:

1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6

Placing the data in order yields:

1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6

With 22 values, 10\% of them is 2.2, so we could remove 2 numbers, one from each end (2 total, or approximately 9\% trimmed), or we could remove 2 numbers from each end (4 total, or approximately 18\% trimmed). Some statisticians would calculate both of these and then use proportions to find an approximation for 10\%. Others might argue that 9\% is closer, so we should use that value. For our purposes, and to stay consistent with the way we handle similar situations in later chapters, we will always opt to remove more numbers than necessary. The logic behind this is simple. You are claiming to remove 10\% of the numbers if we cannot remove exactly 10\% then you either have to remove more or less. We would prefer to err on the side of caution and remove at least the percentage reported. This is not a hard and fast rule and is a good illustration of how many concepts in statistics are open to individual interpretation. Some statisticians even say that the only correct answer to every question asked in statistics is, “it depends”!

Weighted Mean

The weighted mean is a method of calculating the mean when some of the data values are counted frequently. The most common type of weight to use is the frequency, which is the number of times each number is observed in the data. The calculation gives the same result as the standard mean, but each observed data point is multiplied by its weight first, then the total sum is calculated and the result is divided by the sum of the weights.

When we calculated the mean for the children living at home, we could have used a weighted mean calculation. The calculation would look like this:

\frac{1\cdot5 + 2\cdot8 + 3\cdot5 + 4\cdot2 + 5\cdot1 + 6\cdot1} {22}

Technology Note: Weighted Means on the TI 83 or 84 Graphing Calculator

Weighted means are easy to calculate using a graphing calculator. We can use list L1 for the number of children, and in list L2 we will enter the frequencies, or weights.

Enter the data as shown in the left screen below:

For weighted means, we use the same procedure, but enter the two lists in the mean computation. Press [2nd] '[LIST]' to enter the list menu, press the left arrow [leftarrow] to go to the math menu (the middle screen above), and either arrow down or choose 3 for the mean. Finally, press [2nd] \Omega [comma] '[LIST]' [ ) ] [enter] and you will see the screen on the right above. Note that the mean is 2.5, as before.

Percentiles and Quartiles

A percentile is a statistic that identifies the percentage of the data that is less than the given value. The most commonly used percentile is the median. Because it is in the numeric middle of the data, half of the data is below the median. Therefore, we could also call the median the 50^{th} percentile. A 40^{th} percentile would be a value in which 40\% of the numbers are less than that observation. Your first exposure to percentiles was most likely as a baby! To check a child’s physical development, pediatricians use height and weight charts that help them to know how the child compares to children of the same age. A child whose height is in the 70^{th} percentile is taller than 70\% of the children of their same age.

Two very commonly used percentiles are the 25^{th} and 75^{th} percentiles. Because they divide the data into quarters (when taken together with the median), they are referred to as the lower and upper quartiles. They are sometimes abbreviated Q_1 and Q_3. A quartile divides the data into 4 approximately equal groups. Technically, the median is a “middle” quartile and is sometimes referred to as Q_2. Some also refer to the minimum value in a data set as Q_0 and the maximum as Q_4.

Returning to a previous data set:

1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5, 6

Recall that the median (50^{th} percentile) is 2. The quartiles can be thought of as the medians of the upper and lower halves of the data.

In this case, there are an odd number of numbers in each half. If there were an even number of numbers, then we would follow the procedure for medians and average the middle two numbers of each half. Look at the following set of data:

The median in this set is 90. Because it is the middle number, it is not technically part of either the lower or upper halves of the data, so we do not include it when calculating the quartiles. However, not all statisticians agree that this is the proper way to calculate the quartiles in this case. As we mentioned in the last section, some things in statistics are not quite as universally agreed upon as in other branches of mathematics. The exact method for calculating quartiles is another one of those topics. To read more about some alternate methods for calculating quartiles in certain situations, see the following website:

http://mathforum.org/library/drmath/view/60969.html

Technology Note: Medians and Quartiles on the Graphing Calculator

The median and quartiles can also be calculated using the graphing calculator. You may have noticed earlier that median is available in the MATH submenu of the [LIST] menu (see below).

While there is a way to access each quartile individually, we will usually want them both, so we will access them through the one-variable statistics in the [STAT] menu.

You should still have the data in [L1] and the frequencies or weights in [L2], so press [stat], then arrow over to [CALC] (the left screen below) and choose 1-var Stat, which returns you to the Home Screen (see the middle screen below.). Enter [2nd] [L1] [comma] [2nd] [L2] for the data and frequency lists (see third screen). When you press enter, look at the bottom left hand corner of the screen (fourth screen below). You will notice there is an arrow pointing downward to indicate that there is more information. Scroll down to reveal the quartiles and the median (final screen below).

Remember that Q_1 corresponds to the 25^{th} percentile and Q_3 is the 75^{th} percentile.

Lesson Summary

When examining a set of data, we use descriptive statistics to provide information about where the data is centered. The mode is a measure of the most frequently occurring number in a data set and is most useful for categorical data and data measured at the nominal level. The mean and median are two of the most commonly used measures of center. The mean, or average, is the sum of the data points divided by the total number of data points in the set. In a data set that is a sample from a population, the sample mean is notated as \overline{x}. When the entire population is involved, the population mean is \mu. The median is the numeric middle of a data set. If there are an odd number of numbers, this middle value is easy to find. If there is an even number of data values, however, the median is the mean of the middle two values. The median is resistant, that is, it is not affected by the presence of outliers. An outlier is a number that has an extreme value when compared with most of the data. The mean is not resistant, and therefore the median tends to be a more appropriate measure of center to use in examples that contain outliers. Because the mean is the numerical balancing point for the data, is in an extremely important measure of center that is the basis for many other calculations and processes necessary for making useful conclusions about a set of data.

Other measures of center include the midrange, which is the mean of the maximum and minimum values. In an n\% trimmed mean, you remove a certain percentage of the data (half from each end) before calculating the mean. A weighted mean, involves multiplying individual data values by their frequencies or percentages before adding them and then dividing by the total of the weights.

A percentile is a data value in which the specified percentage of the data is below that value. The median is the 50^{th} percentile. Two well-known percentiles are the 25^{th} percentile, which is called the lower quartile (LQ or Q_1), and the 75^{th} percentile, which is called the upper quartile (UQ or Q_3)

Points to Consider

  1. How do you determine which measure of center best describes a particular data set?
  2. What are the effects of outliers on the various measures of spread?
  3. How can we represent data visually using the various measures of center?

Review Questions

  1. In Lois’ 2^{nd}-grade class, all of the students are between 45 and 52\ \;\mathrm{inches} tall, except one boy, Lucas, who is 62'' \;\mathrm{inches} tall. Which of the following statements is true about the heights of all of the students?
    1. The mean height and the median height are about the same
    2. The mean height is greater than the median height.
    3. The mean height is less than the median height.
    4. More information is needed to answer this question.
    5. None of the above is true.
  2. Enrique has a  91, 87, and 95 for his statistics grades for the first three quarters. His mean grade for the year must be a 93 in order for him to be exempt from taking the final exam. Assuming grades are rounded following valid mathematical procedures, what is the lowest whole number grade he can get for the 4^{th} quarter and still be exempt from taking the exam?
  3. How many data points should be removed from each end of a sample of 300 values in order to calculate a 10\% trimmed mean?
    1. 5
    2. 10
    3. 15
    4. 20
    5. 30
  4. In the last example, after removing the correct numbers and summing those remaining, what would you divide by to calculate the mean?
  5. The chart below shows the data from the Galapagos tortoise preservation program with just the number of individual tortoises that were bred in captivity and reintroduced into their native habitat.
Island or Volcano Number of Individuals Repatriated
Wolf 40
Darwin 0
Alcedo 0
Sierra Negra 286
Cerro Azul 357
Santa Cruz 210
Española 1293
San Cristóbal 55
Santiago 498
Pinzón 552
Pinta 0

Figure: Approximate Distribution of Giant Galapagos Tortoises in 2004 ("Estado Actual De Las Poblaciones de Tortugas Terrestres Gigantes en las Islas Galápagos," Marquez, Wiedenfeld, Snell, Fritts, MacFarland, Tapia, y Nanjoa, Scologia Aplicada, Vol. 3, Num. 1,2, pp. 98-11).

For this data, calculate each of the following:

(a) mode

(b) median

(c) mean

(d) a 10\% trimmed mean

(e) midrange

(f) upper and lower quartiles

(g) The percentile for the number of Santiago tortoises reintroduced.

  1. In the previous question, why is the answer to c significantly higher than the answer to b?

Review Answers

  1. There is an outlier that is larger than most of the data. This outlier will “pull” the mean towards it while the median tends to stay in the center of the data, clustered somewhere between 45 and 52.
  2. His mean for all four quarters would need to be at least 92.5 in order to receive the necessary grade. Multiplying 92.5 by 4, yields 370 as the necessary total. His existing grades total to 273. 370-273 = 97.
  3. 10\% of 300 is 30, therefore, we would remove 15 numbers from each end.
  4. 270
    1. 0
    2. 210
    3. 299.2
    4. 222 (10\% of 11 data points is really 1.1, so we decided to remove two points, or about 18\%)
    5. 646.5
    6. Q_1: 0, Q_3: 498
    7. 72.7\%
  5. There is one extreme point, 1293, which causes the mean to be greater than the median.

Further Reading

Vocabulary

Mode
The most frequently occurring number in a data set.
Mean
The average, or the sum of the values in a data set divided by the number of values.
Median
The numeric middle of a data set.
Outlier
An extreme value in a data set.
Resistance
A property of a statistic in which it is not affected by extreme values (outliers).
Midrange
The mean of the minimum and maximum values in a data set.
N\% Trimmed Mean
A mean in which n\% of the original data (equal amounts from either end) is removed before calculating the mean.
Weighted Mean
A mean in which some values contribute more to the sum than others. Each value is multiplied by its weight, or frequency and then the sum of those totals is divided by the total of the weights or frequencies.
Percentiles
A value in a data set in which the given percentage of the data is below that value.
Quartiles
The values that divide a data set roughly into four roughly equal groups. The lower quartile is the 25^{th} percentile, and the upper quartile is the 75^{th} percentile.
Numerical (or Quantitative) Variable
A variable in which the count is the attribute of interest.
Discrete Variable
A numerical variable that only exhibits a finite set of values at given intervals.
Continuous Variable
A numerical variable that can be any of an infinite range of values.
Sample
A smaller, representative subset of the population.
Parameter
A statistical measure or number that summarizes the entire population.
Statistic
A measure or number that summarizes the individuals in a sample.
Sampling Error
The inaccuracy that results from estimating using a sample, rather than the entire population.

Image Attributions

Files can only be attached to the latest version of None

Reviews

Please wait...
You need to be signed in to perform this action. Please sign-in and try again.
Please wait...
Image Detail
Sizes: Medium | Original
 
CK.MAT.ENG.SE.1.Prob-&-Stats-Adv.1.3

Original text