In this Concept, you will learn many definitions of statistical terminology in order to begin talking about statistics. We will demonstrate the reason for using a sample to learn about a population.
Watch This
For an introduction to the importance of statistics, see onlinestatbook, Introduction to Statistics: Importance of Statistics (2:45).
Citation: Online Statistics Education: A Multimedia Course of Study ( http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University.
For a discussion of populations and samples, as well as parameters and statitsics see onlinestatbook, Introduction to Statistics: Inferential Statistics (6:39).
Guidance
In order to learn some basic vocabulary of statistics and learn how to distinguish between different types of variables, we will use the example of information about the Giant Galapagos Tortoise.
Example A
The Galapagos Islands, off the coast of Ecuador in South America, are famous for the amazing diversity and uniqueness of life they possess. One of the most famous Galapagos residents is the Galapagos Giant Tortoise, which is found nowhere else on earth. Charles Darwin’s visit to the islands in the Century and his observations of the tortoises were extremely important in the development of his theory of evolution.
The tortoises lived on nine of the Galapagos Islands, and each island developed its own unique species of tortoise. In fact, on the largest island, there are four volcanoes, and each volcano has its own species. When first discovered, it was estimated that the tortoise population of the islands was around 250,000. Unfortunately, once European ships and settlers started arriving, those numbers began to plummet. Because the tortoises could survive for long periods of time without food or water, expeditions would stop at the islands and take the tortoises to sustain their crews with fresh meat and other supplies for the long voyages. Also, settlers brought in domesticated animals like goats and pigs that destroyed the tortoises' habitat. Today, two of the islands have lost their species, a third island has no remaining tortoises in the wild, and the total tortoise population is estimated to be around 15,000. The good news is there have been massive efforts to protect the tortoises. Extensive programs to eliminate the threats to their habitat, as well as breed and reintroduce populations into the wild, have shown some promise.
Approximate distribution of Giant Galapagos Tortoises in 2004, Estado Actual De Las Poblaciones de Tortugas Terrestres Gigantes en las Islas Galápagos, Marquez, Wiedenfeld, Snell, Fritts, MacFarland, Tapia, y Nanjoa, Scologia Aplicada, Vol. 3, Num. 1,2, pp. 98 11.
Island or Volcano | Species | Climate Type | Shell Shape | Estimate of Total Population | Population Density (per ) | Number of Individuals Repatriated |
---|---|---|---|---|---|---|
Wolf | becki | semi-arid | intermediate | 1139 | 228 | 40 |
Darwin | microphyes | semi-arid | dome | 818 | 205 | 0 |
Alcedo | vanden- burghi | humid | dome | 6,320 | 799 | 0 |
Sierra Negra | guntheri | humid | flat | 694 | 122 | 286 |
Cerro Azul | vicina | humid | dome | 2.574 | 155 | 357 |
Santa Cruz | nigrita | humid | dome | 3,391 | 730 | 210 |
Española | hoodensis | arid | saddle | 869 | 200 | 1,293 |
San Cristóbal | chathamen- sis | semi-arid | dome | 1,824 | 559 | 55 |
Santiago | darwini | humid | intermediate | 1,165 | 124 | 498 |
Pinzón | ephippium | arid | saddle | 532 | 134 | 552 |
Pinta | abingdoni | arid | saddle | 1 | Does not apply | 0 |
Repatriation is the process of raising tortoises and releasing them into the wild when they are grown to avoid local predators that prey on the hatchlings.
Classifying Variables
Statisticians refer to an entire group that is being studied as a population . Each member of the population is called a unit . In this example, the population is all Galapagos Tortoises, and the units are the individual tortoises. It is not necessary for a population or the units to be living things, like tortoises or people. For example, an airline employee could be studying the population of jet planes in her company by studying individual planes.
A researcher studying Galapagos Tortoises would be interested in collecting information about different characteristics of the tortoises. Those characteristics are called variables . Each column of the previous figure contains a variable. In the first column, the tortoises are labeled according to the island (or volcano) where they live, and in the second column, by the scientific name for their species. When a characteristic can be neatly placed into well-defined groups, or categories, that do not depend on order, it is called a categorical variable , or qualitative variable .
The last three columns of the previous figure provide information in which the count, or quantity, of the characteristic is most important. We are interested in the total number of each species of tortoise, or how many individuals there are per square kilometer. This type of variable is called a numerical variable , or quantitative variable .
Example B
Determine whether each of the variables Climate Type, Shell Shape, Number of Tagged Individuals , and Number of Individuals Repatriated are numerical or categorical variables.
Variable | Explanation | Type |
---|---|---|
Climate Type | Many of the islands and volcanic habitats have three distinct climate types. | Categorical |
Shell Shape | Over many years, the different species of tortoises have developed different shaped shells as an adaptation to assist them in eating vegetation that varies in height from island to island. | Categorical |
Number of Tagged Individuals | Tortoises were captured and marked by scientists to study their health and assist in estimating the total population. | Numerical |
Number of Individuals Repatriated | There are two tortoise breeding centers on the islands. Through these programs, many tortoises have been raised and then reintroduced into the wild. | Numerical |
Population vs. Sample
We have already defined a population as the total group being studied. Most of the time, it is extremely difficult or very costly to collect all the information about a population. In the Galapagos, it would be very difficult and perhaps even destructive to search every square meter of the habitat to be sure that you counted every tortoise. In an example closer to home, it is very expensive to get accurate and complete information about all the residents of the United States to help effectively address the needs of a changing population. This is why a complete counting, or census , is only attempted every ten years. Because of these problems, it is common to use a smaller, representative group from the population, called a sample .
You may recall the tortoise data included a variable for the estimate of the population size. This number was found using a sample and is actually just an approximation of the true number of tortoises. If a researcher wanted to find an estimate for the population of a species of tortoises, she would go into the field and locate and mark a number of tortoises. She would then use statistical techniques that we will discuss later in this text to obtain an estimate for the total number of tortoises in the population. In statistics, we call the actual number of tortoises a parameter . Any number that describes the individuals in a sample (length, weight, age) is called a statistic . Each statistic is an estimate of a parameter, whose value may or may not be known.
Errors in Sampling
We have to accept that estimates derived from using a sample have a chance of being inaccurate. This cannot be avoided unless we measure the entire population. The researcher has to accept that there could be variations in the sample due to chance that lead to changes in the population estimate. A statistician would report the estimate of the parameter in two ways: as a point estimate (e.g., 915) and also as an interval estimate . For example, a statistician would report: “I am fairly confident that the true number of tortoises is actually between 561 and 1075.” This range of values is the unavoidable result of using a sample, and not due to some mistake that was made in the process of collecting and analyzing the sample. The difference between the true parameter and the statistic obtained by sampling is called sampling error . It is also possible that the researcher made mistakes in her sampling methods in a way that led to a sample that does not accurately represent the true population.
Example C
What are some possible errors that could be involved in the study of the Galopagos tortoises?
Solution: The researcher could have picked an area to search for tortoises where a large number tend to congregate (near a food or water source, perhaps). If this sample were used to estimate the number of tortoises in all locations, it may lead to a population estimate that is too high.
This type of systematic error in sampling is called bias . Statisticians go to great lengths to avoid the many potential sources of bias. We will investigate this in more detail in a later chapter.
On the Web
http://www.onlinestatbook.com/
http://www.en.wikipedia.org/wiki/Gal%C3%A1pagos_tortoise
Charles Darwin Research Center and Foundation: http://www.darwinfoundation.org
Vocabulary
In statistics, the total group being studied is called the population . The individuals (people, animals, or things) in the population are called units . The characteristics of those individuals of interest to us are called variables . Those variables are of two types: numerical , or quantitative , and categorical , or qualitative .
Because of the difficulties of obtaining information about all units in a population, it is common to use a small, representative subset of the population, called a sample . An actual value of a population variable (for example, number of tortoises, average weight of all tortoises, etc.) is called a parameter . An estimate of a parameter derived from a sample is called a statistic .
Whenever a sample is used instead of the entire population , we have to accept that our results are merely estimates , and therefore, have some chance of being incorrect. This is called sampling error .
Guided Practice
For each of the following variables, indicate whether the variable is categorical or quantitative (numerical).
a. Importance of political party affiliation to people (very, somewhat, or not very important).
b. Hours spent reading yesterday.
c. Weights of adult men, in pounds.
d. Favorite type of book (fiction, nonfiction).
Solutions:
a. This is categorical data because the information collected will fall into one of the three categories: very, somewhat, or not very important.
b. This is measured by numbers of hours, so it is quantitative data.
c. This is measured in pounds, so it is quantitative data.
d. This is categorical data because the information collected will fall into one of the many categories: fiction, nonfiction, et cetera.
Practice
For 1-3, identify the population, the units, and each variable, and tell if the variable is categorical or quantitative.
- A quality control worker with Sweet-Tooth Candy weighs every candy bar to make sure it is very close to the published weight.
- Doris decides to clean her sock drawer out and sorts her socks into piles by color.
- A researcher is studying the effect of a new drug treatment for diabetes patients. She performs an experiment on 200 randomly chosen individuals with type II diabetes. Because she believes that men and women may respond differently, she records each person’s gender, as well as the person's change in blood sugar level after taking the drug for a month.
For 4-6, indicate for each of the following characteristics of an individual whether the variable is categorical or quantitative (numerical):
- Length of arm from elbow to shoulder (in inches)
- Number of DVD’s the person owns.
- Feeling about own height (too tall, too short, about right)
- In Physical Education class, the teacher has the students count off by two’s to divide them into teams. Is this a categorical or quantitative variable?
- A school is studying its students' test scores by grade. Explain how the characteristic 'grade' could be considered either a categorical or a numerical variable.
- What are the best ways to display categorical and numerical data?
- Is it possible for a variable to be considered both categorical and numerical?
- How can you compare the effects of one categorical variable on another or one quantitative variable on another?