5.4: Predicting with Linear Models
Learning Objectives
- Collect and organize data.
- Interpolate using an equation.
- Extrapolate using an equation.
- Predict using an equation.
Introduction
Numerical information appears in all areas of life. You can find it in newspapers, magazines, journals, on the television or on the internet. In the last section, we saw how to find the equation of a line of best fit and how to use this equation to make predictions. The line of ‘best fit’ is a good method if the relationship between the dependent and the independent variables is linear. In this section, you will learn other methods that help us estimate data values. These methods are useful in linear and non-linear relationships equally. The methods you will learn are linear interpolation which is useful if the information you are looking for is between two known points and linear extrapolation which is useful for estimating a value that is either less than or greater than the known values.
Collect and Organize Data
Data can be collected through surveys or experimental measurements.
Surveys are used to collect information about a population. Surveys of the population are common in political polling, health, social science and marketing research. A survey may focus on opinions or factual information depending on its purpose.
Experimental measurements are data sets that are collected during experiments.
The information collected by the US Census Bureau ( www.census.gov http://www.census.gov/ www.census.gov) or the Center for Disease Control ( www.cdc.gov http://www.cdc.gov www.cdc.gov) are examples of data gathered using surveys. The US Census Bureau collects information about many aspects of the US population. The census takes place every ten years and it polls the population of the United States.
Let’s say we are interested in how the median age for first marriages has changed during the \begin{align*}20^{th}\end{align*} century.
Example 1
Median age at first marriage
The US Census gives the following information about the median age at first marriage for males and females.
In 1890, the median age for males was 26.1 and for females it was 22.0.
In 1900, the median age for males was 25.9 and for females it was 21.9.
In 1910, the median age for males was 25.1 and for females it was 21.6.
In 1920, the median age for males was 24.6 and for females it was 21.2.
In 1930, the median age for males was 24.3 and for females it was 21.3.
In 1940, the median age for males was 24.3 and for females it was 21.5.
In 1950, the median age for males was 22.8 and for females it was 20.3.
In 1960, the median age for males was 22.8 and for females it was 20.3.
In 1970, the median age for males was 23.2 and for females it was 20.8.
In 1980, the median age for males was 24.7 and for females it was 22.0.
In 1990, the median age for males was 26.1 and for females it was 23.9.
In 2000, the median age for males was 26.8 and for females it was 25.1.
This is not a very efficient or clear way to display this information. Some better options are organizing the data in a table or a scatter plot.
A table of the data would look like this.
Year | Median Age of Males | Median Age of Females |
---|---|---|
1890 | 26.1 | 22.0 |
1900 | 25.9 | 21.9 |
1910 | 25.1 | 21.6 |
1920 | 24.6 | 21.2 |
1930 | 24.3 | 21.3 |
1940 | 24.3 | 21.5 |
1950 | 22.8 | 20.3 |
1960 | 22.8 | 20.3 |
1970 | 23.2 | 20.8 |
1980 | 24.7 | 22.0 |
1990 | 26.1 | 23.9 |
2000 | 26.8 | 25.1 |
A scatter plot of the data would look like this.
Median Age of Males and Females at First Marriage by Year
The Center for Disease Control collects information about the health of the American people and behaviors that might lead to bad health. The next example shows the percent of women that smoke during pregnancy.
Example 2
Pregnant women and smoking
The CDC has the following information.
In the year 1990, 18.4 percent of pregnant women smoked.
In the year 1991, 17.7 percent of pregnant women smoked.
In the year 1992, 16.9 percent of pregnant women smoked.
In the year 1993, 15.8 percent of pregnant women smoked.
In the year 1994, 14.6 percent of pregnant women smoked.
In the year 1995, 13.9 percent of pregnant women smoked.
In the year 1996, 13.6 percent of pregnant women smoked.
In the year 2000, 12.2 percent of pregnant women smoked.
In the year 2002, 11.4 percent of pregnant women smoked.
In the year 2003, 10.4 percent of pregnant women smoked.
In the year 2004, 10.2 percent of pregnant women smoked.
Let’s organize this data more clearly in a table and in a scatter plot.
Here is a table of the data.
Year | Percent of pregnant women smokers |
---|---|
1990 | 18.4 |
1991 | 17.7 |
1992 | 16.9 |
1993 | 15.8 |
1994 | 14.6 |
1995 | 13.9 |
1996 | 13.6 |
2000 | 12.2 |
2002 | 11.4 |
2003 | 10.4 |
2004 | 10.2 |
Here is a scatter plot of the data.
Percent of Pregnant Women Smokers by Year
Interpolate Using an Equation
Linear interpolation is often used to fill the gaps in a table. Example one shows the median age of males and females at the time of their first marriage. However, the information is only available at ten year intervals. We know the median age of marriage every ten years from 1890 to 2000, but we would like to estimate the median age of marriage for the years in between. Example two gave us the percentage of women smoking while pregnant. But, there is no information collected for 1997, 1998, 1999 and 2001 and we would like to estimate the percentage for these years. Linear interpolation gives you an easy way to do this.
The strategy for linear interpolation is to use a straight line to connect the known data points (we are assuming that the data would be continuous between the two points) on either side of the unknown point. Then we use that equation to estimate the value we are looking for.
Example 3
Estimate the median age for the first marriage of a male in the year 1946.
Year | Median age of males | Median age of females |
---|---|---|
... | ... | ... |
1940 | 24.3 | 21.5 |
1950 | 22.8 | 20.3 |
... | ... | ... |
The table to the left shows only the data for the years 1950 and 1960 because we want to estimate a data point between these two years.
We connect the two points on either side of 1946 with a straight line and find its equation.
\begin{align*}\text{Slope} & & m & = \frac{22.8 - 24.3} {1950 - 1940} = \frac{-1.5} {10} = -0.15\\ & & y & = -0.15x + b\\ & & 24.3 & = -0.15(1940) + b\\ & & b & = 315.3\\ \text{Equation} & & y & = -0.15x + 315.3\end{align*}
To estimate the median age of marriage of males in year 1946 we plug \begin{align*}x = 1946\end{align*} in the equation.
\begin{align*} y = -0.15(1946) + 315.3 = 23.4 \ years\ old\end{align*}
Example 4
Estimate the percentage of pregnant women that were smoking in the year 1998.
Year | Percent of Pregnant Women Smokers |
---|---|
... | ... |
1996 | 13.6 |
2000 | 12.2 |
... | ... |
The table to the left shows only the data for year 1996 and 2000 because we want to estimate a data point between these two years.
Connect the points on either side of 1998 with a straight line and find the equation of that line.
\begin{align*}\text{Slope} & & m & = \frac{12.2 - 13.6} {2000 - 1996} = \frac{-1.4} {4} = -0.35\\ & & y & = -0.35x + b\\ & & 12.2 & = -0.35 (2000) + b\\ & & b & = 712.2\\ \text{Equation} & & y & = -0.35x + 712.2\end{align*}
To estimate the percentage of pregnant women who smoked in year 1998 we plug \begin{align*}x = 1998\end{align*} into the equation.
\begin{align*} y = -0.35 (1998) + 712.2 = 12.9\%\end{align*}
For non-linear data, linear interpolation is often not accurate enough for our purposes. If the points in the data set change by a large amount in the interval in which you are interested, then linear interpolation may not give a good estimate. In that case, it can be replaced by polynomial interpolation which uses a curve instead of a straight line to estimate values between points.
Extrapolating: How to Use it and When Not to Use it
Linear extrapolation can help us estimate values that are either higher or lower than the range of values of our data set. The strategy is similar to linear interpolation. However you only use a subset of the data, rather than all of the data. For linear data, you are ALWAYS more accurate by using the best fit line method of the previous section. For non-linear data, it is sometimes useful to extrapolate using the last two or three data points in order to estimate a \begin{align*}y-\end{align*}value that is higher than the data range. To estimate a value that is higher than the points in the data set, we connect the last two data points with a straight line and find its equation. Then we can use this equation to estimate the value we are trying to find. To estimate a value that is lower than the points in the data set, we follow the same procedure. But we use the first two points of our data instead.
Example 5
Winning Times
The winning times for the women’s 100 meter race are given in the following \begin{align*}\text{table}^3\end{align*}. Estimate the winning time in the year 2010. Is this a good estimate?
Winner | Country | Year | Time (seconds) |
---|---|---|---|
Mary Lines | UK | 1922 | 12.8 |
Leni Schmidt | Germany | 1925 | 12.4 |
Gerturd Glasitsch | Germany | 1927 | 12.1 |
Tollien Schuurman | Netherlands | 1930 | 12.0 |
Helen Stephens | USA | 1935 | 11.8 |
Lulu Mae Hymes | USA | 1939 | 11.5 |
Fanny Blankers-Koen | Netherlands | 1943 | 11.5 |
Marjorie Jackson | Australia | 1952 | 11.4 |
Vera Krepkina | Soviet Union | 1958 | 11.3 |
Wyomia Tyus | USA | 1964 | 11.2 |
Barbara Ferrell | USA | 1968 | 11.1 |
Ellen Strophal | East Germany | 1972 | 11.0 |
Inge Helten | West Germany | 1976 | 11.0 |
Marlies Gohr | East Germany | 1982 | 10.9 |
Florence Griffith Joyner | USA | 1988 | 10.5 |
Solution
We start by making a scatter plot of the data. Connect the last two points on the graph and find the equation of the line.
Winning Times for the Women’s 100 meter Race by Year
\begin{align*}\text{Slope} \qquad m & = \frac{10.5 - 10.9} {1988 - 1982} = \frac{-0.4} {6} = -0.067\\ \qquad y & = -0.067x + b\\ \qquad 10.5 & = -0.067 (1988) + b\\ \qquad b & = 143.7\end{align*}
Equation \begin{align*}y = -0.067x + 143.7 \end{align*}
The winning time in year 2010 is estimated to be:
\begin{align*} y = -0.067 (2010) + 143.7 = \underline{9.03 \ seconds}\end{align*}
\begin{align*}^3\end{align*}Source: http://en.wikipedia.org/wiki/World_Record_progression_100_m_women.
How accurate is this estimate? It is likely that it's not very accurate because 2010 is a long time from 1988. This example demonstrates the weakness of linear extrapolation. Estimates given by linear extrapolation are never as good as using the equation from the best fit line method. In this particular example, the last data point clearly does not fit in with the general trend of the data so the slope of the extrapolation line is much steeper than it should be. As a historical note, the last data point corresponds to the winning time for Florence Griffith Joyner in 1988. After her race, she was accused of using performance-enhancing drugs but this fact was never proven. In addition, there is a question about the accuracy of the timing because some officials said that the tail wind was not accounted for in this race even though all the other races of the day were impacted by a strong wind.
Predict Using an Equation
Linear extrapolation was not a good method to use in the last example. A better method for estimating the winning time in 2010 would be the use of linear regression (i.e. best fit line method) that we learned in the last section. Let’s apply that method to this problem.
Winning Times for the Women’s 100 meter Race by Year
We start by drawing the line of best fit and finding its equation. We use the points (1982, 10.9) and (1958, 11.3).
The equation is \begin{align*}y = -0.017x + 43.9 \end{align*}
In year 2010, \begin{align*}y= -0.017 (2010) + 43.9 = 9.73 \ seconds\end{align*}
This shows a much slower decrease in winning times than linear extrapolation. This method (fitting a line to all of the data) is always more accurate for linear data and approximate linear data. However, the line of best fit in this case will not be useful in the future. For example, the equation predicts that around the year 2582 the time will be about zero seconds, and in years that follow the time will be negative!
Lesson Summary
- A survey is a method of collecting information about a population.
- Experimental measurements are data sets that are collected during experiments.
- Linear interpolation is used to estimate a data value between two experimental measurements. To do so, compute the line through the two adjacent measurements, then use that line to estimate the intermediate value.
- Linear extrapolation is used to estimate a data value either above or below the experimental measurements. Again, find the line defined by the two closest points and use that line to estimate the value.
- The most accurate method of estimating data values from a linear data set is to perform linear regression and estimate the value from the best-fit line.
Review Questions
- Use the data from Example one (Median age at first marriage) to estimate the age at marriage for females in 1946. Fit a line, by hand, to the data before 1970.
- Use the data from Example one (Median age at first marriage) to estimate the age at marriage for females in 1984. Fit a line, by hand, to the data from 1970 on in order to estimate this accurately.
- Use the data from Example one (Median age at first marriage) to estimate the age at marriage for males in 1995. Use linear interpolation between the 1990 and 2000 data points.
- Use the data from Example two (Pregnant women and smoking) to estimate the percent of pregnant smokers in 1997. Use linear interpolation between the 1996 and 2000 data points.
- Use the data from Example two (Pregnant women and smoking) to estimate the percent of pregnant smokers in 2006. Use linear extrapolation with the final two data points.
- Use the data from Example five (Winning times) to estimate the winning time for the female 100 meter race in 1920. Use linear extrapolation because the first two or three data points have a different slope than the rest of the data.
- The table below shows the highest temperature vs. the hours of daylight for the \begin{align*}15^{th}\end{align*} day of each month in the year 2006 in San Diego, California. Estimate the high temperature for a day with 13.2 hours of daylight using linear interpolation.
Hours of daylight | High temperature (F) |
---|---|
10.25 | 60 |
11.0 | 62 |
12 | 62 |
13 | 66 |
13.8 | 68 |
14.3 | 73 |
14 | 86 |
13.4 | 75 |
12.4 | 71 |
11.4 | 66 |
10.5 | 73 |
10 | 61 |
- Using the table above to estimate the high temperature for a day with 9 hours of daylight using linear extrapolation. Is the prediction accurate? Find the answer using line of best fit.
Review Answers
- About 21 years
- 22.8 years
- 26.5 years
- 13.25 percent
- 9.8 percent
- 13.1 seconds
- 70.5 F
- 65 F. Prediction is not very good since we expect cooler temperatures for less daylight hours. The best fit line method of linear regression predicts 58.5 F.
Notes/Highlights Having trouble? Report an issue.
Color | Highlighted Text | Notes | |
---|---|---|---|
Please Sign In to create your own Highlights / Notes | |||
Show More |