5.4: Predicting with Linear Models
Learning Objectives
- Interpolate using an equation.
- Extrapolate using an equation.
- Predict using an equation.
Introduction
Katja’s sales figures were trending downward quickly at first, and she used a line of best fit to describe the numbers. But now they seem to be decreasing more slowly, and fitting the line less and less accurately. How can she make a more accurate prediction of what next week’s sales will be?
In the last lesson we saw how to find the equation of a line of best fit and how to use this equation to make predictions. The line of “best fit” is a good method if the relationship between the dependent and the independent variables is linear. In this section you will learn other methods that are useful even when the relationship isn’t linear.
Linear Interpolation
We use linear interpolation to fill in gaps in our data—that is, to estimate values that fall in between the values we already know. To do this, we use a straight line to connect the known data points on either side of the unknown point, and use the equation of that line to estimate the value we are looking for.
Example 1
The following table shows the median ages of first marriage for men and women, as gathered by the U.S. Census Bureau.
Year | Median age of males | Median age of females |
---|---|---|
1890 | 26.1 | 22.0 |
1900 | 25.9 | 21.9 |
1910 | 25.1 | 21.6 |
1920 | 24.6 | 21.2 |
1930 | 24.3 | 21.3 |
1940 | 24.3 | 21.5 |
1950 | 22.8 | 20.3 |
1960 | 22.8 | 20.3 |
1970 | 23.2 | 20.8 |
1980 | 24.7 | 22.0 |
1990 | 26.1 | 23.9 |
2000 | 26.8 | 25.1 |
Estimate the median age for the first marriage of a male in the year 1946.
Solution
We connect the two points on either side of 1946 with a straight line and find its equation. Here’s how that looks on a scatter plot:
We find the equation by plugging in the two data points:
\begin{align*}m &= \frac{22.8-24.3}{1950-1940}=\frac{-1.5}{10}=-0.15\\
y &= -0.15x+b\\
24.3 &= -0.15(1940)+b\\
b &= 315.3\end{align*}
Our equation is \begin{align*}y=-0.15x+315.3\end{align*}
To estimate the median age of marriage of males in the year 1946, we plug \begin{align*}x = 1946\end{align*}
\begin{align*}y=-0.15(1946)+315.3=23.4\end{align*}
Example 2
The Center for Disease Control collects information about the health of the American people and behaviors that might lead to bad health. The following table shows the percent of women who smoke during pregnancy.
Year | Percent of pregnant women smokers |
---|---|
1990 | 18.4 |
1991 | 17.7 |
1992 | 16.9 |
1993 | 15.8 |
1994 | 14.6 |
1995 | 13.9 |
1996 | 13.6 |
2000 | 12.2 |
2002 | 11.4 |
2003 | 10.4 |
2004 | 10.2 |
Estimate the percentage of pregnant women that were smoking in the year 1998.
Solution
We connect the two points on either side of 1998 with a straight line and find its equation. Here’s how that looks on a scatter plot:
We find the equation by plugging in the two data points:
\begin{align*}m &= \frac{12.2-13.6}{2000-1996}=\frac{-1.4}{4}=-0.35\\
y &= -0.35x+b\\
12.2 &= -0.35(2000)+b\\
b &= 712.2\end{align*}
Our equation is \begin{align*}y=-0.35x+712.2\end{align*}
To estimate the percentage of pregnant women who smoked in the year 1998, we plug \begin{align*}x = 1998\end{align*}
\begin{align*}y=-0.35(1998)+712.2=12.9\%\end{align*}
For non-linear data, linear interpolation is often not accurate enough for our purposes. If the points in the data set change by a large amount in the interval you’re interested in, then linear interpolation may not give a good estimate. In that case, it can be replaced by polynomial interpolation, which uses a curve instead of a straight line to estimate values between points. But that’s beyond the scope of this lesson.
Linear Extrapolation
Linear extrapolation can help us estimate values that are outside the range of our data set. The strategy is similar to linear interpolation: we pick the two data points that are closest to the one we’re looking for, find the equation of the line between them, and use that equation to estimate the coordinates of the missing point.
Example 3
The winning times for the women’s 100 meter race are given in the following table. Estimate the winning time in the year 2010. Is this a good estimate?
Winner | Country | Year | Time (seconds) |
---|---|---|---|
Mary Lines | UK | 1922 | 12.8 |
Leni Schmidt | Germany | 1925 | 12.4 |
Gerturd Glasitsch | Germany | 1927 | 12.1 |
Tollien Schuurman | Netherlands | 1930 | 12.0 |
Helen Stephens | USA | 1935 | 11.8 |
Lulu Mae Hymes | USA | 1939 | 11.5 |
Fanny Blankers-Koen | Netherlands | 1943 | 11.5 |
Marjorie Jackson | Australia | 1952 | 11.4 |
Vera Krepkina | Soviet Union | 1958 | 11.3 |
Wyomia Tyus | USA | 1964 | 11.2 |
Barbara Ferrell | USA | 1968 | 11.1 |
Ellen Strophal | East Germany | 1972 | 11.0 |
Inge Helten | West Germany | 1976 | 11.0 |
Marlies Gohr | East Germany | 1982 | 10.9 |
Florence Griffith Joyner | USA | 1988 | 10.5 |
Solution
We start by making a scatter plot of the data; then we connect the last two points on the graph and find the equation of the line.
\begin{align*}m &= \frac{10.5-10.9}{1988-1982}=\frac{-0.4}{6}=-0.067\\
y &= -0.067x+b\\
10.5 &= -0.067(1988)+b\\
b &= 143.7\end{align*}
Our equation is \begin{align*}y=-0.067x+143.7\end{align*}
The winning time in year 2010 is estimated to be:
\begin{align*}y=-0.067(2010)+143.7= 9.03\end{align*}
Unfortunately, this estimate actually isn’t very accurate. This example demonstrates the weakness of linear extrapolation; it uses only a couple of points, instead of using all the points like the best fit line method, so it doesn’t give as accurate results when the data points follow a linear pattern. In this particular example, the last data point clearly doesn’t fit in with the general trend of the data, so the slope of the extrapolation line is much steeper than it would be if we’d used a line of best fit. (As a historical note, the last data point corresponds to the winning time for Florence Griffith Joyner in 1988. After her race she was accused of using performance-enhancing drugs, but this fact was never proven. In addition, there was a question about the accuracy of the timing: some officials said that tail-wind was not accounted for in this race, even though all the other races of the day were affected by a strong wind.)
Here’s an example of a problem where linear extrapolation does work better than the line of best fit method.
Example 4
A cylinder is filled with water to a height of 73 centimeters. The water is drained through a hole in the bottom of the cylinder and measurements are taken at 2 second intervals. The following table shows the height of the water level in the cylinder at different times.
Time (seconds) | Water level (cm) |
---|---|
0.0 | 73 |
2.0 | 63.9 |
4.0 | 55.5 |
6.0 | 47.2 |
8.0 | 40.0 |
10.0 | 33.4 |
12.0 | 27.4 |
14.0 | 21.9 |
16.0 | 17.1 |
18.0 | 12.9 |
20.0 | 9.4 |
22.0 | 6.3 |
24.0 | 3.9 |
26.0 | 2.0 |
28.0 | 0.7 |
30.0 | 0.1 |
a) Find the water level at time 15 seconds.
b) Find the water level at time 27 seconds
c) What would be the original height of the water in the cylinder if the water takes 5 extra seconds to drain? (Find the height at time of –5 seconds.)
Solution
Here’s what the line of best fit would look like for this data set:
Notice that the data points don’t really make a line, and so the line of best fit still isn’t a terribly good fit. Just a glance tells us that we’d estimate the water level at 15 seconds to be about 27 cm, which is more than the water level at 14 seconds. That’s clearly not possible! Similarly, at 27 seconds we’d estimate the water to have all drained out, which it clearly hasn’t yet.
So let’s see what happens if we use linear extrapolation and interpolation instead. First, here are the lines we’d use to interpolate between 14 and 16 seconds, and between 26 and 28 seconds.
a) The slope of the line between points (14, 21.9) and (16, 17.1) is \begin{align*}m=\frac{17.1-21.9}{16-14}=\frac{-4.8}{2}=-2.4\end{align*}
Plugging in \begin{align*}x = 15\end{align*}
b) The slope of the line between points (26, 2) and (28, 0.7) is \begin{align*}m=\frac{0.7-2}{28-26}=\frac{-1.3}{2}=-.65\end{align*}
Plugging in \begin{align*}x = 27\end{align*}
c) Finally, we can use extrapolation to estimate the height of the water at -5 seconds. The slope of the line between points (0, 73) and (2, 63.9) is \begin{align*}m=\frac{63.9-73}{2-0}=\frac{-9.1}{2}=-4.55\end{align*}
Plugging in \begin{align*}x = -5\end{align*}
To make linear interpolation easier in the future, you might want to use the calculator at http://www.ajdesigner.com/phpinterpolation/linear_interpolation_equation.php. Plug in the coordinates of the first known data point in the blanks labeled \begin{align*}x_1\end{align*}
Review Questions
- Use the data from Example 1 (Median age at first marriage) to estimate the age at marriage for females in 1946. Fit a line, by hand, to the data before 1970.
- Use the data from Example 1 (Median age at first marriage) to estimate the age at marriage for females in 1984. Fit a line, by hand, to the data from 1970 on in order to estimate this accurately.
- Use the data from Example 1 (Median age at first marriage) to estimate the age at marriage for males in 1995. Use linear interpolation between the 1990 and 2000 data points.
- Use the data from Example 2 (Pregnant women and smoking) to estimate the percentage of pregnant smokers in 1997. Use linear interpolation between the 1996 and 2000 data points.
- Use the data from Example 2 (Pregnant women and smoking) to estimate the percentage of pregnant smokers in 2006. Use linear extrapolation with the final two data points.
- Use the data from Example 3 (Winning times) to estimate the winning time for the female 100-meter race in 1920. Use linear extrapolation because the first two or three data points have a different slope than the rest of the data.
- The table below shows the highest temperature vs. the hours of daylight for the \begin{align*}15^{th}\end{align*}
15th day of each month in the year 2006 in San Diego, California.
Hours of daylight | High temperature (F) |
---|---|
10.25 | 60 |
11.0 | 62 |
12 | 62 |
13 | 66 |
13.8 | 68 |
14.3 | 73 |
14 | 86 |
13.4 | 75 |
12.4 | 71 |
11.4 | 66 |
10.5 | 73 |
10 | 61 |
(a) What would be a better way to organize this table if you want to make the relationship between daylight hours and temperature easier to see?
(b) Estimate the high temperature for a day with 13.2 hours of daylight using linear interpolation.
(c) Estimate the high temperature for a day with 9 hours of daylight using linear extrapolation. Is the prediction accurate?
(d) Estimate the high temperature for a day with 9 hours of daylight using a line of best fit.
The table below lists expected life expectancies based on year of birth (US Census Bureau). Use it to answer questions 8-15.
Birth year | Life expectancy in years |
---|---|
1930 | 59.7 |
1940 | 62.9 |
1950 | 68.2 |
1960 | 69.7 |
1970 | 70.8 |
1980 | 73.7 |
1990 | 75.4 |
2000 | 77 |
- Make a scatter plot of the data.
- Use a line of best fit to estimate the life expectancy of a person born in 1955.
- Use linear interpolation to estimate the life expectancy of a person born in 1955.
- Use a line of best fit to estimate the life expectancy of a person born in 1976.
- Use linear interpolation to estimate the life expectancy of a person born in 1976.
- Use a line of best fit to estimate the life expectancy of a person born in 2012.
- Use linear extrapolation to estimate the life expectancy of a person born in 2012.
- Which method gives better estimates for this data set? Why?
The table below lists the high temperature for the fist day of the month for the year 2006 in San Diego, California (Weather Underground). Use it to answer questions 16-21.
Month number | Temperature (F) |
---|---|
1 | 63 |
2 | 66 |
3 | 61 |
4 | 64 |
5 | 71 |
6 | 78 |
7 | 88 |
8 | 78 |
9 | 81 |
10 | 75 |
11 | 68 |
12 | 69 |
- Draw a scatter plot of the data.
- Use a line of best fit to estimate the temperature in the middle of the \begin{align*}4^{th}\end{align*}
4th month (month 4.5). - Use linear interpolation to estimate the temperature in the middle of the \begin{align*}4^{th}\end{align*}
4th month (month 4.5). - Use a line of best fit to estimate the temperature for month 13 (January 2007).
- Use linear extrapolation to estimate the temperature for month 13 (January 2007).
- Which method gives better estimates for this data set? Why?
- Name a real-world situation where you might want to make predictions based on available data. Would linear extrapolation/interpolation or the best fit method be better to use in that situation? Why?
Texas Instruments Resources
In the CK-12 Texas Instruments Algebra I FlexBook, there are graphing calculator activities designed to supplement the objectives for some of the lessons in this chapter. See http://www.ck12.org/flexr/chapter/9615.