1.1: Definitions of Statistical Terminology
Learning Objectives
 Distinguish between quantitative and categorical variables.
 Distinguish between continuous and discrete variables.
 Understand the concept of a population and the reason for using a sample.
 Distinguish between a statistic and a parameter.
Introduction
In this lesson, students will be introduced to some basic statistical vocabulary of statistics and learn how to distinguish between different types of variables. We will use the realworld example of information about the Giant Galapagos Tortoise.
Galapagos Tortoise on Santa Cruz.
The Galapagos Tortoises
The Galapagos Islands, off the coast of Ecuador in South America, are famous for the amazing diversity and uniqueness of life they possess. One of the most famous Galapagos residents is the Galapagos Giant Tortoise, which is found nowhere else on earth. Charles Darwin’s visit to the islands in the \begin{align*}19^{th}\end{align*}
Galapagos Map.
The tortoises lived on nine of the Galapagos Islands and each island developed its own unique species of tortoise. In fact, on the largest island, there are four volcanoes and each volcano has its own species. When first discovered, it was estimated that the tortoise population of the islands was around \begin{align*}250,000\end{align*}
Island or Volcano  Species  Climate Type  Shell Shape  Estimate of Total Population 
Population Density (per \begin{align*}km^2\end{align*} 
Number of Individuals Repatriated 

Wolf  becki  semiarid  intermediate 
\begin{align*}1,139\end{align*} 
\begin{align*}228\end{align*} 
\begin{align*}40\end{align*} 
Darwin  microphyes  semiarid  dome 
\begin{align*}818\end{align*} 
\begin{align*}205\end{align*} 

Alcedo  vanden burghi  humid  dome 
\begin{align*}6,320\end{align*} 
\begin{align*}799\end{align*} 

Sierra Negra  guntheri  humid  flat 
\begin{align*}694\end{align*} 
\begin{align*}122\end{align*} 
\begin{align*}286\end{align*} 
Cerro Azul  vicina  humid  dome 
\begin{align*}2574\end{align*} 
\begin{align*}155\end{align*} 
\begin{align*}357\end{align*} 
Santa Cruz  nigrita  humid  dome 
\begin{align*}3,391\end{align*} 
\begin{align*}730\end{align*} 
\begin{align*}210\end{align*} 
Española  hoodensis  arid  saddle 
\begin{align*}869\end{align*} 
\begin{align*}200\end{align*} 
\begin{align*}1,293\end{align*} 
San Cristóbal  chathamen sis  semiarid  dome 
\begin{align*}1,824\end{align*} 
\begin{align*}559\end{align*} 
\begin{align*}55\end{align*} 
Santiago  darwini  humid  intermediate 
\begin{align*}1,165\end{align*} 
\begin{align*}124\end{align*} 
\begin{align*}498\end{align*} 
Pinzón  ephippium  arid  saddle 
\begin{align*}532\end{align*} 
\begin{align*}134\end{align*} 
\begin{align*}552\end{align*} 
Pinta  abingdoni  arid  saddle 
\begin{align*}1\end{align*} 
Does not apply 
Tortoise With Domeshaped Shell on Santa Cruz Island.
Classifying Variables
Statisticians refer to the entire group that is being studied as a population. In this example, the population is all Galapagos Tortoises. Each member of the population is called a unit. In this example the units are each individual tortoises. It is not necessary for a population, or the units, to be living things like tortoises or people. An airline employee could be studying the population of jet planes in her company by studying individual planes.
A researcher studying Galapagos Tortoises would be interested in collecting information about different characteristics of the tortoises. Those characteristics are called variables. Each column of the previous figure contains a variable. In the first two columns, the tortoises are grouped according to the island (or volcano) where they live and the scientific names for each species. When a characteristic can be neatly placed into welldefined groups, or categories that do not depend on order, it is called a categorical variable (some statisticians use the word qualitative).
The last three columns of the previous figure provide information in which the count, or quantity of the characteristic is most important. For example, we are interested in the total number of each species of tortoise, or how many individuals there are per square kilometer. This type of variable is called numerical (or quantitative). Note that repatriation is the process of raising tortoises and releasing them into the wild when grown to avoid local predators that prey on hatchlings. The figure below explains the remaining variables in the previous figure and labels them as categorical or numerical.
Variable  Explanation  Type 

Climate Type  Many of the islands and volcanic habitats have three distinct climate types.  Categorical 
Shell Shape  Over many years, the different species of tortoise have developed different shaped shells as an adaptation to assist them in eating vegetation that varies in height from island to island.  Categorical 
Number of tagged individuals  The number of tortoises that were captured and marked by scientists to study their health and assist in estimating the total population.  Numerical 
Number of Individuals Repatriated  There are two tortoise breeding centers on the islands. Through those programs, many tortoises have been raised and then reintroduced into the wild.  Numerical 
Variables can be further classified as either discrete or continuous. A discrete numerical variable can only have values at specific values. For example, the number of tortoises reintroduced into the wild must be a whole number. (How would you introduce half of a tortoise?!) But don’t get the wrong idea! It is possible for a variable to have fractional values and still be discrete. Shoe sizes, for example, are discrete as their values occur at set increments: \begin{align*}7, 7 \frac{1}{2}, 8, 8\frac{1}{2}\end{align*}
On the other hand, the population density, which means the average number of tortoises per square kilometer, could be any positive number. This is an example of a continuous variable. Even though the numbers in the table have been rounded, the number of square kilometers can, in theory, be any value depending on the size of the habitat. The average (or mean) rainfall in a city is a continuous variable. Within a reasonable range of values, all amounts of rainfall are possible. However, someone measuring that rainfall may only measure to the nearest centimeter, and it might then be considered discrete. Practically speaking, anytime you measure a variable that can only be measured in discrete values, you are effectively using a variable that is not truly continuous.
Population vs. Sample
We have already defined a population as the total group being studied. Most of the time, it is extremely difficult or very costly to collect all the information about a population. In the Galapagos, how would you count ALL the tortoises of one species? It would be very difficult and perhaps even destructive to search every square meter of the habitat to be sure that you counted every tortoise. In an example closer to home, it is very expensive (and maybe even impossible!!) to get accurate and complete information about all the residents of the United States to help effectively address the needs of a changing population. This is why a complete counting (census) is only attempted every ten years.
Because of these problems, it is common to use a smaller, representative group from the population called a sample.
You may recall the tortoise data included a variable for the estimate of the population size. This number was found using a sample and is actually just an approximation of the true number of tortoises. When a researcher wanted to find an estimate for the population of a species of tortoise, she would go into the field and locate and mark a number of tortoises. She would then use statistical techniques that we will discover later in this text to obtain an estimate for the total number of tortoises in the population. In statistics, we call the actual number of tortoises a parameter. The number of tortoises in the sample, or any other number that describes the individuals in the sample (like their length, or weight, or age), is called a statistic. In general, each statistic is an estimate of a parameter, whose value is not known exactly.
In the Table below, are the actual data from the species of tortoise found on the Volcano Darwin, on Isabela Island. (Note: the word “data” is the plural of the word “datum”, which means the result of a single measurement.) The number of captured individuals is a statistic as it deals with the sample. The actual population is a parameter that we are trying to estimate.
Number of Individuals Captured  Population Estimate  Population Estimate Interval 

\begin{align*}160\end{align*} 
\begin{align*}818\end{align*} 
\begin{align*}5611075\end{align*} 
Errors in Sampling
Unfortunately, there is a downside to using sampling. We have to accept that estimates using a sample have a chance of being inaccurate or even downright wrong! This cannot be avoided unless we sample the entire population. You can see this in the next figure. The actual data not only includes an estimate, but also an interval of the likely true values for the population parameter. The researcher has to accept that there could be variations in the sample due to chance which lead to changes in the population estimate. A statistician would not say that the parameter is a specific number like \begin{align*}915\end{align*}
“I am fairly confident that the true number of tortoises is actually between \begin{align*}561\end{align*}
This range of values is the unavoidable result of using a sample, and not due to some mistake that was made in the process of collecting and analyzing the sample. In general, the potential difference between the true parameter and the statistic obtained from using a sample is called sampling error. It is also possible that the researchers made mistakes in their sampling methods in a way that led to a sample that does not accurately represent the true population. For example, they could have picked an area to search for tortoises where a large number tend to congregate (near a food or water source perhaps). If this sample were used to estimate the number of tortoises in all locations, it may lead to a population estimate that is too high. This type of systematic error in sampling is called bias. Statisticians go to great lengths to avoid the many potential sources of bias. We will investigate this in more detail in a later chapter.
Lesson Summary
In statistics, the total group being studied is called the population. The individuals (people, animals, or things) in the population are called units. The characteristics of those individuals of interest to us are called variables. Those variables generally are of two types, numerical or quantitative, and categorical or qualitative.
Quantitative variables can be further categorized as those that can only have set, integral values, or discrete variables, and those that can be a range of values, or continuous variables.
Because of the difficulties of obtaining information about all units in a population, it is common to use a small, representative subset of the population called a sample. An actual value of a population variable (for example, number of tortoises, average weight of all tortoises, etc.) is called a parameter. An estimate of a parameter from a sample is called a statistic.
Whenever a sample is used instead of the entire population, we have to accept that our results are merely estimates and therefore have some chance of being incorrect. This is called sampling error.
Points to Consider
 How do we summarize, display, and compare categorical and numerical data differently?
 What are the best ways to display categorical and numerical data?
 Is it possible for a variable to be considered both categorical and numerical?
 How can you compare the effects of one categorical variable on another or one quantitative variable on another?
Review Questions
 In each of the following situations, identify the population, the units, each variable, and tell if the variable is categorical or quantitative. If it is quantitative, then identify it further as either discrete or continuous.
 A quality control worker with Sweettooth Candy weighs every \begin{align*}100^{th}\end{align*}
100th candy bar to make sure it is very close to the published weight. POPULATION:
 UNITS:
 VARIABLE:
 TYPE:
 Doris decides to clean her sock drawer out and sorts her socks into piles by color.
 POPULATION:
 UNITS:
 VARIABLE:
 TYPE:
 A researcher is studying the effect of a new drug treatment for diabetes patients. She performs an experiment on \begin{align*}200\end{align*}
200 randomly chosen individuals with Type II diabetes. Because she believes that men and women may respond differently, she records each person’s gender, as well as their change in sugar level after taking the drug for a month. POPULATION:
 UNITS:
 VARIABLE 1:
 TYPE:
 VARIABLE 2:
 TYPE:
 A quality control worker with Sweettooth Candy weighs every \begin{align*}100^{th}\end{align*}
 In Physical Education class, the teacher has them count off by two’s to divide them into teams. Is this a categorical or quantitative variable?
 A school is studying their students' test scores by grade. Explain how the characteristic “grade” could be considered either a categorical or a numerical variable.
Review Answers

 POPULATION: All candy bars made by the company
 UNITS: each individual candy bar
 VARIABLE: weight of the candy bars
 TYPE: Quantitative. It is continuous. The weights could be any weight reasonably close to the desired weight due to variation in the number and weight of individual candies. Note: if the worker decided to sort the candy bars as acceptable, too light, or too heavy, the same scenario could include a categorical variable.
 POPULATION: All of Doris’ socks
 UNITS: each sock
 VARIABLE: color of socks
 TYPE: Categorical
 POPULATION: All diabetes sufferers
 UNITS: each individual diabetes patient
 VARIABLE 1: change in sugar level (\begin{align*}+\end{align*}
+ or \begin{align*}\end{align*}− )  TYPE: Quantitative, continuous
 VARIABLE 2: gender
 TYPE: Categorical
 An argument could be made that by definition, it could be a discrete quantitative variable, but this is really a categorical variable. Students are either on one team or another. The use of the digits “\begin{align*}1\end{align*}
1 ” and “\begin{align*}2\end{align*}2 ” to put the students in groups has no significant numerical meaning. The teacher could have just as easily had the students say "blue" and "red."  This variable could be easily described as categorical, as students are in one of the four classes (Freshman, Sophomore, Junior, Senior), but it could also be appropriate to think of those classes as grades \begin{align*}912\end{align*}
9−12 . The numbers do signify order and therefore could be considered to have numerical significance. If so, it would be a discrete numerical variable.
Further Reading
 onlinestatbook.com/
 en.wikipedia.org/wiki/Gal%C3%A1pagos_tortoise
 pes.ucf.k12.pa.us/Themes/Endangered%20Animals/pages/gtortoise5.htm
 Charles Darwin Research Center and Foundation: www.darwinfoundation.org
Notes/Highlights Having trouble? Report an issue.
Color  Highlighted Text  Notes  

Show More 