5.6: Fitting Lines to Data
What if you had a graph with many random ordered pairs plotted on it? How could you find the line that best describes those plotted points? After completing this Concept, you'll be able to find the line of best fit for scattered data.
Watch This
CK-12 Foundation: 0506S Fitting a Line to Data (H264)
Guidance
Katja has noticed that sales are falling off at her store lately. She plots her sales figures for each week on a graph and sees that the points are trending downward, but they don’t quite make a straight line. How can she predict what her sales figures will be over the next few weeks?
In real-world problems, the relationship between our dependent and independent variables is linear, but not perfectly so. We may have a number of data points that don’t quite fit on a straight line, but we may still want to find an equation representing those points. In this lesson, we’ll learn how to find linear equations to fit real-world data.
Make a Scatter Plot
A scatter plot is a plot of all the ordered pairs in a table. Even when we expect the relationship we’re analyzing to be linear, we usually can’t expect that all the points will fit perfectly on a straight line. Instead, the points will be “scattered” about a straight line.
There are many reasons why the data might not fall perfectly on a line. Small errors in measurement are one reason; another reason is that the real world isn’t always as simple as a mathematical abstraction, and sometimes math can only describe it approximately.
Example A
Make a scatter plot of the following ordered pairs:
(0, 2); (1, 4.5); (2, 9); (3, 11); (4, 13); (5, 18); (6, 19.5)
Solution
We make a scatter plot by graphing all the ordered pairs on the coordinate axis:
Fit a Line to Data
Notice that the points look like they might be part of a straight line, although they wouldn’t fit perfectly on a straight line. If the points were perfectly lined up, we could just draw a line through any two of them, and that line would go right through all the other points as well. When the points aren’t lined up perfectly, we just have to find a line that is as close to all the points as possible.
Here you can see that we could draw many lines through the points in our data set. However, the red line \begin{align*}A\end{align*} is the line that best fits the points. To prove this mathematically, we would measure all the distances from each data point to line \begin{align*}A\end{align*}: and then we would show that the sum of all those distances—or rather, the square root of the sum of the squares of the distances—is less than it would be for any other line.
Actually proving this is a lesson for a much more advanced course, so we won’t do it here. And finding the best fit line in the first place is even more complex; instead of doing it by hand, we’ll use a graphing calculator or just “eyeball” the line, as we did above—using our visual sense to guess what line fits best.
For more practice eyeballing lines of best fit, try the Java applet at http://mste.illinois.edu/activity/regression/. Click on the green field to place up to 50 points on it, then use the slider to adjust the slope of the red line to try and make it fit the points. (The thermometer shows how far away the line is from the points, so you want to try to make the thermometer reading as low as possible.) Then click “Show Best Fit” to show the actual best fit line in blue. Refresh the page or click “Reset” if you want to try again. For more of a challenge, try scattering the points in a less obvious pattern.
Write an Equation For a Line of Best Fit
Once you draw the line of best fit, you can find its equation by using two points on the line. Finding the equation of the line of best fit is also called linear regression.
Caution: Make sure you don’t get caught making a common mistake. Sometimes the line of best fit won’t pass straight through any of the points in the original data set. This means that you can’t just use two points from the data set – you need to use two points that are on the line, which might not be in the data set at all.
In Example 1, it happens that two of the data points are very close to the line of best fit, so we can just use these points to find the equation of the line: (1, 4.5) and (3, 11).
Start with the slope-intercept form of a line: \begin{align*}y=mx+b\end{align*}
Find the slope: \begin{align*}m=\frac{11-4.5}{3-1}=\frac{6.5}{2}=3.25\end{align*}.
So \begin{align*}y=3.25x+b\end{align*}.
Plug (3, 11) into the equation: \begin{align*}11=3.25(3)+b \Rightarrow b=1.25\end{align*}
So the equation for the line that fits the data best is \begin{align*}y=3.25x+1.25\end{align*}.
Perform Linear Regression With a Graphing Calculator
The problem with eyeballing a line of best fit, of course, is that you can’t be sure how accurate your guess is. To get the most accurate equation for the line, we can use a graphing calculator instead. The calculator uses a mathematical algorithm to find the line that minimizes the sum of the squares.
Example B
Use a graphing calculator to find the equation of the line of best fit for the following data:
(3, 12), (8, 20), (1, 7), (10, 23), (5, 18), (8, 24), (11, 30), (2, 10)
Solution
Step 1: Input the data in your calculator.
Press [STAT] and choose the [EDIT] option. Input the data into the table by entering the \begin{align*}x-\end{align*}values in the first column and the \begin{align*}y-\end{align*}values in the second column.
Step 2: Find the equation of the line of best fit.
Press [STAT] again use right arrow to select [CALC] at the top of the screen.
Chose option number 4, \begin{align*}LinReg (ax+b)\end{align*}, and press [ENTER]
The calculator will display \begin{align*}LinReg (ax+b)\end{align*}.
Press [ENTER] and you will be given the \begin{align*}a-\end{align*} and \begin{align*}b-\end{align*}values.
Here \begin{align*}a\end{align*} represents the slope and \begin{align*}b\end{align*} represents the \begin{align*}y-\end{align*}intercept of the equation. The linear regression line is \begin{align*}y=2.01x+5.94\end{align*}.
Step 3. Draw the scatter plot.
To draw the scatter plot press [STATPLOT] [2nd] [Y=].
Choose Plot 1 and press [ENTER].
Press the On option and set the Type as scatter plot (the one highlighted in black).
Make sure that the \begin{align*}X\end{align*} list and \begin{align*}Y\end{align*} list names match the names of the columns of the table in Step 1.
Choose the box or plus as the mark, since the simple dot may make it difficult to see the points.
Press [GRAPH] and adjust the window size so you can see all the points in the scatter plot.
Step 4. Draw the line of best fit through the scatter plot.
Press [Y=]
Enter the equation of the line of best fit that you just found: \begin{align*}y=2.01x+5.94\end{align*}.
Press [GRAPH].
Solve Real-World Problems Using Linear Models of Scattered Data
Once we’ve found the line of best fit for a data set, we can use the equation of that line to predict other data points.
Example C
Nadia is training for a 5K race. The following table shows her times for each month of her training program. Find an equation of a line of fit. Predict her running time if her race is in August.
Month | Month number | Average time (minutes) |
---|---|---|
January | 0 | 40 |
February | 1 | 38 |
March | 2 | 39 |
April | 3 | 38 |
May | 4 | 33 |
June | 5 | 30 |
Solution
Let’s make a scatter plot of Nadia’s running times. The independent variable, \begin{align*}x\end{align*}, is the month number and the dependent variable, \begin{align*}y\end{align*}, is the running time. We plot all the points in the table on the coordinate plane, and then sketch a line of fit.
Two points on the line are (0, 42) and (4, 34). We’ll use them to find the equation of the line:
\begin{align*}m &= \frac{34-42}{4-0}=-\frac{8}{4}=-2\\ y &= -2x+b\\ 42 &= -2(0)+b \Rightarrow b=42\\ y &= -2x+42\end{align*}
In a real-world problem, the slope and \begin{align*}y-\end{align*}intercept have a physical significance. In this case, the slope tells us how Nadia’s running time changes each month she trains. Specifically, it decreases by 2 minutes per month. Meanwhile, the \begin{align*}y-\end{align*}intercept tells us that when Nadia started training, she ran a distance of 5K in 42 minutes.
The problem asks us to predict Nadia’s running time in August. Since June is defined as month number 5, August will be month number 7. We plug \begin{align*}x = 7\end{align*} into the equation of the line of best fit:
\begin{align*}y=-2(7)+42=-14+42=28\end{align*}
The equation predicts that Nadia will run the 5K race in 28 minutes.
In this solution, we eyeballed a line of fit. Using a graphing calculator, we can find this equation for a line of fit instead: \begin{align*}y=-2.2x+43.7\end{align*}
If we plug \begin{align*}x = 7\end{align*} into this equation, we get \begin{align*}y=-2.2(7)+43.7=28.3\end{align*}. This means that Nadia will run her race in 28.3 minutes. You see that the graphing calculator gives a different equation and a different answer to the question. The graphing calculator result is more accurate, but the line we drew by hand still gives a good approximation to the result. And of course, there’s no guarantee that Nadia will actually finish the race in that exact time; both answers are estimates, it’s just that the calculator’s estimate is slightly more likely to be right.
Watch this video for help with the Examples above.
CK-12 Foundation: Fitting a Line to Data
Vocabulary
- A scatter plot is a plot of all the ordered pairs in a table. Even when we expect the relationship we’re analyzing to be linear, we usually can’t expect that all the points will fit perfectly on a straight line. Instead, the points will be “scattered” about a straight line.
- Once you draw the line of best fit, you can find its equation by using two points on the line. Finding the equation of the line of best fit is also called linear regression.
Guided Practice
Peter is testing the burning time of “BriteGlo” candles. The following table shows how long it takes to burn candles of different weights. Assume it’s a linear relation, so we can use a line to fit the data. If a candle burns for 95 hours, what must be its weight in ounces?
Candle weight (oz) | Time (hours) |
---|---|
2 | 15 |
3 | 20 |
4 | 35 |
5 | 36 |
10 | 80 |
16 | 100 |
22 | 120 |
26 | 180 |
Solution
Let’s make a scatter plot of the data. The independent variable, \begin{align*}x\end{align*}, is the candle weight and the dependent variable, \begin{align*}y\end{align*}, is the time it takes the candle to burn. We plot all the points in the table on the coordinate plane, and draw a line of fit.
Two convenient points on the line are (0,0) and (30, 200). Find the equation of the line:
\begin{align*}m &= \frac{200}{30}=\frac{20}{3}\\ y &= \frac{20}{3}x+b\\ 0 &= \frac{20}{3}(0)+b \Rightarrow b=0\\ y &= \frac{20}{3}x\end{align*}
A slope of \begin{align*}\frac{20}{3}=6 \frac{2}{3}\end{align*} tells us that for each extra ounce of candle weight, the burning time increases by \begin{align*}6 \frac{2}{3}\end{align*} hours. A \begin{align*}y-\end{align*}intercept of zero tells us that a candle of weight 0 oz will burn for 0 hours.
The problem asks for the weight of a candle that burns 95 hours; in other words, what’s the \begin{align*}x-\end{align*}value that gives a \begin{align*}y-\end{align*}value of 95? Plugging in \begin{align*}y=95\end{align*}:
\begin{align*}y = \frac{20}{3}x \Rightarrow 95 = \frac{20}{3} x \Rightarrow x = \frac{285}{20}=\frac{57}{4}=14 \frac{1}{4}\end{align*}
A candle that burns 95 hours weighs 14.25 oz.
A graphing calculator gives the linear regression equation as \begin{align*}y=6.1x+5.9\end{align*} and a result of 14.6 oz.
Practice
For problems 1-4, draw the scatter plot and find an equation that fits the data set by hand.
- (57, 45); (65, 61); (34, 30); (87, 78); (42, 41); (35, 36); (59, 35); (61, 57); (25, 23); (35, 34)
- (32, 43); (54, 61); (89, 94); (25, 34); (43, 56); (58, 67); (38, 46); (47, 56); (39, 48)
- (12, 18); (5, 24); (15, 16); (11, 19); (9, 12); (7, 13); (6, 17); (12, 14)
- (3, 12); (8, 20); (1, 7); (10, 23); (5, 18); (8, 24); (2, 10)
- Use the graph from problem 1 to predict the \begin{align*}y-\end{align*}values for two \begin{align*}x-\end{align*}values of your choice that are not in the data set.
- Use the graph from problem 2 to predict the \begin{align*}x-\end{align*}values for two \begin{align*}y-\end{align*}values of your choice that are not in the data set.
- Use the equation from problem 3 to predict the \begin{align*}y-\end{align*}values for two \begin{align*}x-\end{align*}values of your choice that are not in the data set.
- Use the equation from problem 4 to predict the \begin{align*}x-\end{align*}values for two \begin{align*}y-\end{align*}values of your choice that are not in the data set.
For problems 9-11, use a graphing calculator to find the equation of the line of best fit for the data set.
- (57, 45); (65, 61); (34, 30); (87, 78); (42, 41); (35, 36); (59, 35); (61, 57); (25, 23); (35, 34)
- (32, 43); (54, 61); (89, 94); (25, 34); (43, 56); (58, 67); (38, 46); (47, 56); (95, 105); (39, 48)
- (12, 18); (3, 26); (5, 24); (15, 16); (11, 19); (0, 27); (9, 12); (7, 13); (6, 17); (12, 14)
- Graph the best fit line on top of the scatter plot for problem 10. Then pick a data point that’s close to the line, and change its \begin{align*}y-\end{align*}value to move it much farther from the line.
- Calculate the new best fit line with that one point changed; write the equation of that line along with the coordinates of the new point.
- How much did the slope of the best fit line change when you changed that point?
- Graph the scatter plot from problem 11 and change one point as you did in the previous problem.
- Calculate the new best fit line with that one point changed; write the equation of that line along with the coordinates of the new point.
- Did changing that one point seem to affect the slope of the best fit line more or less than it did in the previous problem? What might account for this difference?
- Shiva is trying to beat the samosa-eating record. The current record is 53.5 samosas in 12 minutes. Each day he practices and the following table shows how many samosas he eats each day for the first week of his training.
Day | No. of samosas |
---|---|
1 | 30 |
2 | 34 |
3 | 36 |
4 | 36 |
5 | 40 |
6 | 43 |
7 | 45 |
(a) Draw a scatter plot and find an equation to fit the data.
(b) Will he be ready for the contest if it occurs two weeks from the day he started training?
(c) What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?
- Anne is trying to find the elasticity coefficient of a Superball. She drops the ball from different heights and measures the maximum height of the ball after the bounce. The table below shows the data she collected.
Initial height (cm) | Bounce height (cm) |
---|---|
30 | 22 |
35 | 26 |
40 | 29 |
45 | 34 |
50 | 38 |
55 | 40 |
60 | 45 |
65 | 50 |
70 | 52 |
(a) Draw a scatter plot and find the equation.
(b) What height would she have to drop the ball from for it to bounce 65 cm?
(c) What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?
(d) Does the \begin{align*}y-\end{align*}intercept make sense? Why isn’t it (0, 0)?
- The following table shows the median California family income from 1995 to 2002 as reported by the US Census Bureau.
Year | Income |
---|---|
1995 | 53,807 |
1996 | 55,217 |
1997 | 55,209 |
1998 | 55,415 |
1999 | 63,100 |
2000 | 63,206 |
2001 | 63,761 |
2002 | 65,766 |
(a) Draw a scatter plot and find the equation.
(b) What would you expect the median annual income of a Californian family to be in year 2010?
(c) What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?
(d) Inflation in the U.S. is measured by the Consumer Price Index, which increased by 20% between 1995 and 2002. Did the median income of California families keep up with inflation over that time period? (In other words, did it increase by at least 20%?)
Image Attributions
Here you'll learn how to make a scatter plot of a set of data. You'll also learn how to find the line that best fits that data.