5.6: Predicting with Linear Models
Numerical information appears in all areas of life. You can find it in newspapers, in magazines, in journals, on the television, or on the Internet. In the last lesson, you saw how to find the equation of a line of best fit. Using a line of best fit is a good method if the relationship between the dependent and independent variables is linear. Not all data fits a straight line, though. This lesson will show other methods to help estimate data values. These methods are useful in both linear and non-linear relationships.
Linear Interpolation
Linear interpolation is useful when looking for a value between given data points. It can be considered as “filling in the gaps” of a table of data.
The strategy for linear interpolation is to use a straight line to connect the known data points on either side of the unknown point. Linear interpolation is often not accurate for non-linear data. If the points in the data set change by a large amount, linear interpolation may not give a good estimate.
Linear Extrapolation
Linear extrapolation can help us estimate values that are either higher or lower than the values in the data set. Think of this as “the long-term estimate” of the data.
The strategy for linear extrapolation is to use a subset of the data instead of the entire data set. This is especially true for non-linear data you will encounter in later chapters. For this type of data, it is sometimes useful to extrapolate using the last two or three data points in order to estimate a value higher than the data range.
Collecting and Organizing Data
Data can be collected through various means, including surveys or experiments.
A survey is a data collection method used to gather information about individuals’ opinions, beliefs, or habits.
The information collected by the U.S. Census Bureau or the Center for Disease Control are examples of data gathered using surveys. The U.S. Census Bureau collects information about many aspects of the U.S. population.
An experiment is a controlled test or investigation.
Let’s say we are interested in how the median age for first marriages has changed during the century. The U.S. Census provides the following information about the median age at first marriage for males and females. Below is the table of data and its corresponding scatter plot.
Year | Median Age of Males | Median Age of Females |
---|---|---|
1890 | 26.1 | 22.0 |
1900 | 25.9 | 21.9 |
1910 | 25.1 | 21.6 |
1920 | 24.6 | 21.2 |
1930 | 24.3 | 21.3 |
1940 | 24.3 | 21.5 |
1950 | 22.8 | 20.3 |
1960 | 22.8 | 20.3 |
1970 | 23.2 | 20.8 |
1980 | 24.7 | 22.0 |
1990 | 26.1 | 23.9 |
2000 | 26.8 | 25.1 |
Median Age of Males and Females at First Marriage by Year
Example: Estimate the median age for the first marriage of a male in the year 1946.
Solution: We will first use the method of interpolation because there is a “gap” needing to be filled. 1946 is between 1940 and 1950, so these are the data points we will use.
By connecting the two points, an equation can be found.
To estimate the median age of marriage of males in year 1946, substitute in the equation.
Example: The Center for Disease Control (CDC) has the following information regarding the percentage of pregnant women smokers organized by year. Estimate the percentage of pregnant women that were smoking in the year 1998.
Year | Percent |
---|---|
1990 | 18.4 |
1991 | 17.7 |
1992 | 16.9 |
1993 | 15.8 |
1994 | 14.6 |
1995 | 13.9 |
1996 | 13.6 |
2000 | 12.2 |
2002 | 11.4 |
2003 | 10.4 |
2004 | 10.2 |
Percent of Pregnant Women Smokers by Year
Solution: We want to use the information close to 1998 to interpolate the data. We do this by connecting the points on either side of 1998 with a straight line and find the equation of that line.
To estimate the percentage of pregnant women who smoked in year 1998, substitute into the equation.
Predicting Using an Equation
When linear interpolation and linear extrapolation do not produce accurate predictions, using the line of best fit (linear regression) may be the best choice. The “by hand” and calculator methods of determining the line of best fit were presented in the last lesson.
Example: The winning times for the women’s 100-meter race are given in the following table. Estimate the winning time in the year 2010. Is this a good estimate?
Winner | Ctry. | Year | Seconds | Winner | Ctry. | Year | Seconds |
---|---|---|---|---|---|---|---|
Mary Lines | UK | 1922 | 12.8 | Vera Krepkina | Sov. | 1958 | 11.3 |
Leni Schmidt | Germ. | 1925 | 12.4 | Wyomia Tyus | USA | 1964 | 11.2 |
Gertrurd Glasitsch | Germ. | 1927 | 12.1 | Barbara Ferrell | USA | 1968 | 11.1 |
Tollien Schuurman | Neth. | 1930 | 12.0 | Ellen Strophal | E. Germ. | 1972 | 11.0 |
Helen Stephens | USA | 1935 | 11.8 | Inge Helten | W. Germ. | 1975 | 11.0 |
Lulu Mae Hymes | USA | 1939 | 11.5 | Marlies Gohr | E. Germ. | 1982 | 10.9 |
Fanny Blankers-Koen | Neth. | 1943 | 11.5 | Florence Griffith Joyner | USA | 1988 | 10.5 |
Marjorie Jackson | Austr. | 1952 | 11.4 |
Solution: Start by making a scatter plot of the data. Connect the last two points on the graph and find the equation of the line.
Winning Times for the Women’s 100-meter Race by Year
: http://en.wikipedia.org/wiki/World_Record_progression_100_m_women.
The winning time in year 2010 is estimated to be: .
How accurate is this estimate? It is likely that it's not very accurate because 2010 is a long time from 1988. This example demonstrates the weakness of linear extrapolation. Estimates given by linear extrapolation are never as good as using the equation from the line of best fit method. In this particular example, the last data point clearly does not fit in with the general trend of the data so the slope of the extrapolation line is much steeper than it should be.
As a historical note, the last data point corresponds to the winning time for Florence Griffith Joyner in 1988. After her race, she was accused of using performance-enhancing drugs but this fact was never proven. In addition, there is a question about the accuracy of the timing because some officials said that the tail wind was not accounted for in this race even though all the other races of the day were impacted by a strong wind.
Practice Set
Sample explanations for some of the practice exercises below are available by viewing the following video. Note that there is not always a match between the number of the practice exercise in the video and the number of the practice exercise listed in the following exercise set. However, the practice exercise is the same in both. CK-12 Basic Algebra: Predicting with Linear Models (11:46)
- What does it mean to interpolate the data? In which cases would this method be useful?
- How is interpolation different from extrapolation? In which cases would extrapolation be more beneficial?
- What was the problem with using the interpolation method to come up with an equation for the women’s Olympic winning times?
- Use the Winning Times data and determine an equation for the line of best fit.
- Use the Median Age at First Marriage data to estimate the age at marriage for females in 1946. Fit a line, by hand, to the data before 1970.
- Use the Median Age at First Marriage data to estimate the age at marriage for females in 1984. Fit a line, by hand, to the data from 1970 on in order to estimate this accurately.
- Use the Median Age at First Marriage data to estimate the age at marriage for males in 1995. Use linear interpolation between the 1990 and 2000 data points.
- Use the data from Pregnant Women and Smoking to estimate the percent of pregnant smokers in 1997. Use linear interpolation between the 1996 and 2000 data points.
- Use the data from Pregnant Women and Smoking to estimate the percent of pregnant smokers in 2006. Use linear extrapolation with the final two data points.
- Use the Winning Times data to estimate the winning time for the female 100-meter race in 1920. Use linear extrapolation because the first two or three data points have a different slope than the rest of the data.
- The table below shows the highest temperature vs. the hours of daylight for the day of each month in the year 2006 in San Diego, California. Using linear interpolation, estimate the high temperature for a day with 13.2 hours of daylight.
Hours of daylight | High temperature |
---|---|
10.25 | 60 |
11.0 | 62 |
12 | 62 |
13 | 66 |
13.8 | 68 |
14.3 | 73 |
14 | 86 |
13.4 | 75 |
12.4 | 71 |
11.4 | 66 |
10.5 | 73 |
10 | 61 |
- Use the table above to estimate the high temperature for a day with 9 hours of daylight using linear extrapolation. Is the prediction accurate? Find the answer using line of best fit.