<meta http-equiv="refresh" content="1; url=/nojavascript/">

# 1.1: Definitions of Statistical Terminology

Difficulty Level: At Grade Created by: CK-12

## Learning Objectives

• Distinguish between quantitative and categorical variables.
• Distinguish between continuous and discrete variables.
• Understand the concept of a population and the reason for using a sample.
• Distinguish between a statistic and a parameter.

## Introduction

In this lesson, students will be introduced to some basic statistical vocabulary of statistics and learn how to distinguish between different types of variables. We will use the real-world example of information about the Giant Galapagos Tortoise.

Galapagos Tortoise on Santa Cruz.

## The Galapagos Tortoises

The Galapagos Islands, off the coast of Ecuador in South America, are famous for the amazing diversity and uniqueness of life they possess. One of the most famous Galapagos residents is the Galapagos Giant Tortoise, which is found nowhere else on earth. Charles Darwin’s visit to the islands in the $19^{th}$ Century and his observations of the tortoises were extremely important in the development of his theory of evolution.

Galapagos Map.

The tortoises lived on nine of the Galapagos Islands and each island developed its own unique species of tortoise. In fact, on the largest island, there are four volcanoes and each volcano has its own species. When first discovered, it was estimated that the tortoise population of the islands was around $250,000$. Unfortunately, once European ships and settlers started arriving, those numbers began to plummet. Because the tortoises could survive for long periods of time without food or water, expeditions would stop at the islands and take the tortoises to sustain their crews with fresh meat and other supplies for the long voyages. Settlers brought in domesticated animals like goats and pigs that destroyed the tortoise’s habitat. Today, two of the islands have lost their species, a third island has no remaining tortoises in the wild, and the total tortoise population is estimated to be around $15,000$. The good news is there have been massive efforts to protect the tortoises. Extensive programs to eliminate the threats to their habitat, as well as breed and reintroduce populations into the wild, have shown some promise.

Approximate distribution of Giant Galapagos Tortoises in 2004, Estado Actual De Las Poblaciones de Tortugas Terrestres Gigantes en las Islas Galápagos, Marquez, Wiedenfeld, Snell, Fritts, MacFarland, Tapia, y Nanjoa, Scologia Aplicada, Vol. 3, Num. 1,2, pp. 98 11.
Island or Volcano Species Climate Type Shell Shape Estimate of Total Population Population Density (per $km^2$) Number of Individuals Repatriated
Wolf becki semi-arid intermediate $1,139$ $228$ $40$
Darwin microphyes semi-arid dome $818$ $205$ $0$
Alcedo vanden- burghi humid dome $6,320$ $799$ $0$
Sierra Negra guntheri humid flat $694$ $122$ $286$
Cerro Azul vicina humid dome $2574$ $155$ $357$
Santa Cruz nigrita humid dome $3,391$ $730$ $210$
Española hoodensis arid saddle $869$ $200$ $1,293$
San Cristóbal chathamen- sis semi-arid dome $1,824$ $559$ $55$
Santiago darwini humid intermediate $1,165$ $124$ $498$
Pinzón ephippium arid saddle $532$ $134$ $552$
Pinta abingdoni arid saddle $1$ Does not apply $0$

Tortoise With Dome-shaped Shell on Santa Cruz Island.

## Classifying Variables

Statisticians refer to the entire group that is being studied as a population. In this example, the population is all Galapagos Tortoises. Each member of the population is called a unit. In this example the units are each individual tortoises. It is not necessary for a population, or the units, to be living things like tortoises or people. An airline employee could be studying the population of jet planes in her company by studying individual planes.

A researcher studying Galapagos Tortoises would be interested in collecting information about different characteristics of the tortoises. Those characteristics are called variables. Each column of the previous figure contains a variable. In the first two columns, the tortoises are grouped according to the island (or volcano) where they live and the scientific names for each species. When a characteristic can be neatly placed into well-defined groups, or categories that do not depend on order, it is called a categorical variable (some statisticians use the word qualitative).

The last three columns of the previous figure provide information in which the count, or quantity of the characteristic is most important. For example, we are interested in the total number of each species of tortoise, or how many individuals there are per square kilometer. This type of variable is called numerical (or quantitative). Note that repatriation is the process of raising tortoises and releasing them into the wild when grown to avoid local predators that prey on hatchlings. The figure below explains the remaining variables in the previous figure and labels them as categorical or numerical.

Explanation of Remaining Variables.
Variable Explanation Type
Climate Type Many of the islands and volcanic habitats have three distinct climate types. Categorical
Shell Shape Over many years, the different species of tortoise have developed different shaped shells as an adaptation to assist them in eating vegetation that varies in height from island to island. Categorical
Number of tagged individuals The number of tortoises that were captured and marked by scientists to study their health and assist in estimating the total population. Numerical
Number of Individuals Repatriated There are two tortoise breeding centers on the islands. Through those programs, many tortoises have been raised and then reintroduced into the wild. Numerical

Variables can be further classified as either discrete or continuous. A discrete numerical variable can only have values at specific values. For example, the number of tortoises reintroduced into the wild must be a whole number. (How would you introduce half of a tortoise?!) But don’t get the wrong idea! It is possible for a variable to have fractional values and still be discrete. Shoe sizes, for example, are discrete as their values occur at set increments: $7, 7 \frac{1}{2}, 8, 8\frac{1}{2}$ etc... You should also know that all categorical variables are discrete.

On the other hand, the population density, which means the average number of tortoises per square kilometer, could be any positive number. This is an example of a continuous variable. Even though the numbers in the table have been rounded, the number of square kilometers can, in theory, be any value depending on the size of the habitat. The average (or mean) rainfall in a city is a continuous variable. Within a reasonable range of values, all amounts of rainfall are possible. However, someone measuring that rainfall may only measure to the nearest centimeter, and it might then be considered discrete. Practically speaking, anytime you measure a variable that can only be measured in discrete values, you are effectively using a variable that is not truly continuous.

## Population vs. Sample

We have already defined a population as the total group being studied. Most of the time, it is extremely difficult or very costly to collect all the information about a population. In the Galapagos, how would you count ALL the tortoises of one species? It would be very difficult and perhaps even destructive to search every square meter of the habitat to be sure that you counted every tortoise. In an example closer to home, it is very expensive (and maybe even impossible!!) to get accurate and complete information about all the residents of the United States to help effectively address the needs of a changing population. This is why a complete counting (census) is only attempted every ten years.

Because of these problems, it is common to use a smaller, representative group from the population called a sample.

You may recall the tortoise data included a variable for the estimate of the population size. This number was found using a sample and is actually just an approximation of the true number of tortoises. When a researcher wanted to find an estimate for the population of a species of tortoise, she would go into the field and locate and mark a number of tortoises. She would then use statistical techniques that we will discover later in this text to obtain an estimate for the total number of tortoises in the population. In statistics, we call the actual number of tortoises a parameter. The number of tortoises in the sample, or any other number that describes the individuals in the sample (like their length, or weight, or age), is called a statistic. In general, each statistic is an estimate of a parameter, whose value is not known exactly.

In the Table below, are the actual data from the species of tortoise found on the Volcano Darwin, on Isabela Island. (Note: the word “data” is the plural of the word “datum”, which means the result of a single measurement.) The number of captured individuals is a statistic as it deals with the sample. The actual population is a parameter that we are trying to estimate.

Tortoise Data for Darwin Volcano, Isabela Island.
Number of Individuals Captured Population Estimate Population Estimate Interval
$160$ $818$ $561-1075$

## Errors in Sampling

Unfortunately, there is a downside to using sampling. We have to accept that estimates using a sample have a chance of being inaccurate or even downright wrong! This cannot be avoided unless we sample the entire population. You can see this in the next figure. The actual data not only includes an estimate, but also an interval of the likely true values for the population parameter. The researcher has to accept that there could be variations in the sample due to chance which lead to changes in the population estimate. A statistician would not say that the parameter is a specific number like $915$, but would most likely report something like the following:

“I am fairly confident that the true number of tortoises is actually between $561$ and $1075$.”

This range of values is the unavoidable result of using a sample, and not due to some mistake that was made in the process of collecting and analyzing the sample. In general, the potential difference between the true parameter and the statistic obtained from using a sample is called sampling error. It is also possible that the researchers made mistakes in their sampling methods in a way that led to a sample that does not accurately represent the true population. For example, they could have picked an area to search for tortoises where a large number tend to congregate (near a food or water source perhaps). If this sample were used to estimate the number of tortoises in all locations, it may lead to a population estimate that is too high. This type of systematic error in sampling is called bias. Statisticians go to great lengths to avoid the many potential sources of bias. We will investigate this in more detail in a later chapter.

## Lesson Summary

In statistics, the total group being studied is called the population. The individuals (people, animals, or things) in the population are called units. The characteristics of those individuals of interest to us are called variables. Those variables generally are of two types, numerical or quantitative, and categorical or qualitative.

Quantitative variables can be further categorized as those that can only have set, integral values, or discrete variables, and those that can be a range of values, or continuous variables.

Because of the difficulties of obtaining information about all units in a population, it is common to use a small, representative subset of the population called a sample. An actual value of a population variable (for example, number of tortoises, average weight of all tortoises, etc.) is called a parameter. An estimate of a parameter from a sample is called a statistic.

Whenever a sample is used instead of the entire population, we have to accept that our results are merely estimates and therefore have some chance of being incorrect. This is called sampling error.

## Points to Consider

1. How do we summarize, display, and compare categorical and numerical data differently?
2. What are the best ways to display categorical and numerical data?
3. Is it possible for a variable to be considered both categorical and numerical?
4. How can you compare the effects of one categorical variable on another or one quantitative variable on another?

## Review Questions

1. In each of the following situations, identify the population, the units, each variable, and tell if the variable is categorical or quantitative. If it is quantitative, then identify it further as either discrete or continuous.
1. A quality control worker with Sweet-tooth Candy weighs every $100^{th}$ candy bar to make sure it is very close to the published weight.
1. POPULATION:
2. UNITS:
3. VARIABLE:
4. TYPE:
2. Doris decides to clean her sock drawer out and sorts her socks into piles by color.
1. POPULATION:
2. UNITS:
3. VARIABLE:
4. TYPE:
3. A researcher is studying the effect of a new drug treatment for diabetes patients. She performs an experiment on $200$ randomly chosen individuals with Type II diabetes. Because she believes that men and women may respond differently, she records each person’s gender, as well as their change in sugar level after taking the drug for a month.
1. POPULATION:
2. UNITS:
3. VARIABLE 1:
4. TYPE:
5. VARIABLE 2:
6. TYPE:
2. In Physical Education class, the teacher has them count off by two’s to divide them into teams. Is this a categorical or quantitative variable?
3. A school is studying their students' test scores by grade. Explain how the characteristic “grade” could be considered either a categorical or a numerical variable.

1. POPULATION: All candy bars made by the company
2. UNITS: each individual candy bar
3. VARIABLE: weight of the candy bars
4. TYPE: Quantitative. It is continuous. The weights could be any weight reasonably close to the desired weight due to variation in the number and weight of individual candies. Note: if the worker decided to sort the candy bars as acceptable, too light, or too heavy, the same scenario could include a categorical variable.
1. POPULATION: All of Doris’ socks
2. UNITS: each sock
3. VARIABLE: color of socks
4. TYPE: Categorical
1. POPULATION: All diabetes sufferers
2. UNITS: each individual diabetes patient
3. VARIABLE 1: change in sugar level ($+$ or $-$)
4. TYPE: Quantitative, continuous
5. VARIABLE 2: gender
6. TYPE: Categorical
1. An argument could be made that by definition, it could be a discrete quantitative variable, but this is really a categorical variable. Students are either on one team or another. The use of the digits “$1$” and “$2$” to put the students in groups has no significant numerical meaning. The teacher could have just as easily had the students say "blue" and "red."
2. This variable could be easily described as categorical, as students are in one of the four classes (Freshman, Sophomore, Junior, Senior), but it could also be appropriate to think of those classes as grades $9-12$. The numbers do signify order and therefore could be considered to have numerical significance. If so, it would be a discrete numerical variable.

Feb 23, 2012

Jul 03, 2014