5.3: Fitting a Line to Data
Learning Objectives
- Make a scatter plot.
- Fit a line to data and write an equation for that line.
- Perform linear regression with a graphing calculator.
- Solve real-world problems using linear models of scattered data.
Introduction
Often in application problems, the relationship between our dependent and independent variables is linear. That means that the graph of the dependent variable vs. independent variable will be a straight line. In many cases we don’t know the equation of the line but we have data points that were collected from measurements or experiments. The goal of this section is to show how we can find an equation of a line from data points collected from experimental measurements.
Make a Scatter Plot
A scatter plot is a plot of all the ordered pairs in the table. This means that a scatter plot is a relation, and not necessarily a function. Also, the scatter plot is discrete, as it is a set of distinct points. Even when we expect the relationship we are analyzing to be linear, we should not expect that all the points would fit perfectly on a straight line. Rather, the points will be “scattered” about a straight line. There are many reasons why the data does not fall perfectly on a line such as measurement error and outliers.
Measurement error is always present as no measurement device is perfectly accurate. In measuring length, for example, a ruler with millimeter markings will be more accurate than a ruler with just centimeter markings.
An outlier is an accurate measurement that does not fit with the general pattern of the data. It is a statistical fluctuation like rolling a die ten times and getting the six side all ten times. It can and will happen, but not very often.
Example 1
Make a scatter plot of the following ordered pairs: (0, 2), (1, 4.5), (2, 9), (3, 11), (4, 13), (5, 18), (6, 19.5)
Solution
We make a scatter plot by graphing all the ordered pairs on the coordinate axis.
Fit a Line to Data
Notice that the points look like they might be part of a straight line, although they would not fit perfectly on a straight line. If the points were perfectly lined up it would be quite easy to draw a line through all of them and find the equation of that line. However, if the points are “scattered”, we try to find a line that best fits the data.
You see that we can draw many lines through the points in our data set. These lines have equations that are very different from each other. We want to use the line that is closest to all the points on the graph. The best candidate in our graph is the red line \begin{align*}A\end{align*}. We want to minimize the sum of the distances from the point to the line of fit as you can see in the figure below.
Finding this line mathematically is a complex process and is not usually done by hand. We usually “eye-ball” the line or find it exactly by using a graphing calculator or computer software such as Excel. The line in the graph above is “eye-balled,” which means we drew a line that comes closest to all the points in the scatter plot.
When we use the line of best fit we are assuming that there is a continuous linear function that will approximate the discrete values of the scatter plot. We can use this to interpret unknown values.
Write an Equation for a Line of Best Fit
Once you draw the line of best fit, you can find its equation by using two points on the line. Finding the equation of the line of best fit is also called linear regression.
Caution: Make sure you don’t get caught making a common mistake. In many instances the line of best fit will not pass through many or any of the points in the original data set. This means that you can’t just use two random points from the data set. You need to use two points that are on the line.
We see that two of the data points are very close to the line of best fit, so we can use these points to find the equation of the line (1, 4.5) and (3, 11).
Start with the slope-intercept form of a line \begin{align*}y = mx + b\end{align*}.
Find the slope \begin{align*} m = \frac{11 - 4.5} {3 - 1} = \frac{6.5} {2} = 3.25\end{align*}
Then \begin{align*}y = 3.25x + b\end{align*}
Plug (3, 11) into the equation. \begin{align*}11 = 3.25(3) + b \Rightarrow b = 1.25\end{align*}
The equation for the line that fits the data best is \begin{align*}y = 3.25x + 1.25\end{align*}.
Perform Linear Regression with a Graphing Calculator
Drawing a line of fit can be a good approximation but you can't be sure that you are getting the best results because you are guessing where to draw the line. Two people working with the same data might get two different equations because they would be drawing different lines. To get the most accurate equation for the line, we can use a graphing calculator. The calculator uses a mathematical algorithm to find the line that minimizes the sum of the squares.
Example 2
Use a graphing calculator to find the equation of the line of best fit for the following data (3, 12), (8, 20), (1, 7), (10, 23), (5, 18), (8, 24), (11, 30), (2, 10).
Solution
Step 1 Input the data in your calculator.
Press [STAT] and choose the [EDIT] option.
Input the data into the table by entering the \begin{align*}x\end{align*} values in the first column and the \begin{align*}y\end{align*} values in the second column.
Step 2 Find the equation of the line of best fit.
Press [STAT] again use right arrow to select [CALC] at the top of the screen.
Chose option number 4: \begin{align*}LinReg(ax + b)\end{align*} and press [ENTER]
The calculator will display \begin{align*}LinReg(ax + b)\end{align*}
Press [ENTER] and you will be given the \begin{align*}a\end{align*} and \begin{align*}b\end{align*} values.
Here \begin{align*}a\end{align*} represents the slope and \begin{align*}b\end{align*} represents the \begin{align*}y-\end{align*}intercept of the equation. The linear regression line is \begin{align*}y= 2.01x + 5.94\end{align*}.
Step 3 Draw the scatter plot.
To draw the scatter plot press [STATPLOT] [2nd] [Y=].
Choose Plot 1 and press [ENTER].
Press the On option and choose the Type as scatter plot (the one highlighted in black).
Make sure that the \begin{align*}X\end{align*} list and \begin{align*}Y\end{align*} list names match the names of the columns of the table in Step 1.
Choose the box or plus as the mark since the simple dot may make it difficult to see the points.
Press [GRAPH] and adjust the window size so you can see all the points in the scatter plot.
Step 4 Draw the line of best fit through the scatter plot.
Press [Y=]
Enter the equation of the line of best fit that you just found \begin{align*}Y_1 =2.01X+ 5.94\end{align*}
Press [GRAPH].
Solve Real-World Problems Using Linear Models of Scattered Data
In a real-world problem, we use a data set to find the equation of the line of best fit. We can then use the equation to predict values of the dependent or independent variables. The usual procedure is as follows.
- Make a scatter plot of the given data.
- Draw a line of best fit.
- Find an equation of a line either using two points on the line or the TI-83/84 calculator.
- Use the equation to answer the questions asked in the problem.
Example 3
Gal is training for a 5 K race (a total of 5000 meters, or about 3.1 miles). The following table shows her times for each month of her training program. Assume here that her times will decrease in a straight line with time (does that seem like a good assumption?) Find an equation of a line of fit. Predict her running time if her race is in August.
Month | Month number | Average time (minutes) |
---|---|---|
January | 0 | 40 |
February | 1 | 38 |
March | 2 | 39 |
April | 3 | 38 |
May | 4 | 33 |
June | 5 | 30 |
Solution
Let’s make a scatter plot of Gal’s running times. The independent variable, \begin{align*}x\end{align*}, is the month number and the dependent variable, \begin{align*}y\end{align*}, is the running time in minutes. We plot all the points in the table on the coordinate plane.
Draw a line of fit.
Choose two points on the line (0, 41) and (4, 34).
Find the equation of the line.
\begin{align*} m & = \frac{34 - 41} {4 - 0} = - \frac{7} {4} = - 1 \frac{3} {4}\\ y & = -\frac{7} {4} x + b\\ 41 & = - \frac{7} {2} (0) + b \Rightarrow b = 41\\ y & = - \frac{7} {4} x + 41\end{align*}
In a real-world problem, the slope and \begin{align*}y-\end{align*}intercept have a physical significance.
\begin{align*}\text{Slope} = \frac{number\ of\ minutes} {month}\end{align*}
Since the slope is negative, the number of minutes Gal spends running a 5 K race decreased as the months pass. The slope tells us that Gal’s running time decreases by \begin{align*}\frac{7}{4}\end{align*} or 1.75 minutes per month.
The \begin{align*}y-\end{align*}intercept tells us that when Gal started training, she ran a distance of 5 K in 41 minutes, which is just an estimate, since the actual time was 40 minutes.
The problem asks us to predict Gal’s running time in August. Since June is assigned to month number five, then August will be month number seven. We plug \begin{align*}x=7\end{align*} into the equation of the line of best fit.
\begin{align*} y = - \frac{7} {4} (7) + 41 = - \frac{49} {4} + 41 = - \frac{49} {4} + \frac{164} {4} = \frac{115} {4} = 28 \frac{3} {4}\end{align*}
The equation predicts that Gal will be running the 5 K race in 28.75 minutes.
In this solution, we eye-balled a line of best fit. Using a graphing calculator, we found this equation for a line of fit \begin{align*}y = -2.2x + 43.7\end{align*}.
If we plug \begin{align*}x = 7\end{align*} in this equation, we get \begin{align*}y = -2.2(7) + 43.7 = 28.3\end{align*}. This means that Gal ran her race in 28.3 minutes. You see that the graphing calculator gives a different equation and a different answer to the question. The graphing calculator result is more accurate but the line we drew by hand still gives a good approximation to the result.
Example 4
Baris is testing the burning time of “BriteGlo” candles. The following table shows how long it takes to burn candles of different weights. Assume it’s a linear relation and we can then use a line to fit the data. If a candle burns for 95 hours, what must be its weight in ounces?
Candle weight (oz) | Time (hours) |
---|---|
2 | 15 |
3 | 20 |
4 | 35 |
5 | 36 |
10 | 80 |
16 | 100 |
22 | 120 |
26 | 180 |
Solution
Let’s make a scatter plot of the data. The independent variable, \begin{align*}x\end{align*}, is the candle weight in ounces and the dependent variable, \begin{align*}y\end{align*}, is the time in hours it takes the candle to burn. We plot all the points in the table on the coordinate plane.
Then we draw the line of best fit.
Now pick two points on the line (0,0) and (30, 200).
Find the equation of the line:
\begin{align*} m & = \frac{200} {30} = \frac{20} {3}\\ y & = \frac{20} {3} x + b\\ 0 & = \frac{20} {3} (0) + b \Rightarrow b =0\\ y & = \frac{20} {3} x\end{align*}
In this problem the slope is burning time divided by candle weight. A slope of \begin{align*}\frac{20}{3} = 6 \frac{2}{3}\end{align*} tells us for each extra ounce of candle weight, the burning time increases by \begin{align*}6 \frac{2}{3} \ hours\end{align*}.
A \begin{align*}y-\end{align*}intercept of zero tells us that a candle of weight 0 oz will burn for 0 hours.
The problem asks for the weight of a candle that burns 95 hours. We are given the value of \begin{align*}y = 95\end{align*}. We need to use the equation to find the corresponding value of \begin{align*}x\end{align*}.
\begin{align*} y = \frac{20} {3} x \Rightarrow = \frac{20} {3} x \Rightarrow x = \frac{285} {20} = \frac{57} {4} = 14\frac{1} {4}\end{align*}
A candle that burns 95 hours weighs 14.25 oz.
The graphing calculator gives the linear regression equation as \begin{align*}y = 6.1x + 5.9\end{align*} and a result of 14.6 oz.
Notice that we can use the line of best fit to estimate the burning time for a candle of any weight.
Lesson Summary
- A scatter plot is a plot of all ordered pairs of experimental measurements.
- Measurement error arises from inaccuracies in the measurement device. All measurements of continuous values contain measurement error.
- An outlier is an experimental measurement that does not fit with the general pattern of the data.
- For experimental measurements with a linear relationship, you can draw a line of best fit which minimizes the distance of each point to the line. Finding the line of best fit is called linear regression. A statistics class can teach you the math behind linear regression. For now, you can estimate it visually or use a graphing calculator.
Review Questions
For each data set, draw the scatter plot and find the equation of the line of best fit for the data set by hand.
- (57, 45) (65, 61) (34, 30) (87, 78) (42, 41) (35, 36) (59, 35) (61, 57) (25, 23) (35, 34)
- (32, 43) (54, 61) (89, 94) (25, 34) (43, 56) (58, 67) (38, 46) (47, 56) (39, 48)
- (12, 18) (5, 24) (15, 16) (11, 19) (9, 12) (7, 13) (6, 17) (12, 14)
- (3, 12) (8, 20) (1, 7) (10, 23) (5, 18) (8, 24) (2, 10)
For each data set, use a graphing calculator to find the equation of the line of best fit.
- (57, 45) (65, 61) (34, 30) (87, 78) (42, 41) (35, 36) (59, 35) (61, 57) (25, 23) (35, 34)
- (32, 43) (54, 61) (89, 94) (25, 34) (43, 56) (58, 67) (38, 46) (47, 56) (95, 105) (39, 48)
- (12, 18) (3, 26) (5, 24) (15, 16) (11, 19) (0, 27) (9, 12) (7, 13) (6, 17) (12, 14)
- Shiva is trying to beat the samosa eating record. The current record is 53.5 samosas in 12 minutes. The following table shows how many samosas he eats during his daily practice for the first week of his training. Will he be ready for the contest if it occurs two weeks from the day he started training? What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?
Day | No. of Samosas |
---|---|
1 | 30 |
2 | 34 |
3 | 36 |
4 | 36 |
5 | 40 |
6 | 43 |
7 | 45 |
- Nitisha is trying to find the elasticity coefficient of a Superball. She drops the ball from different heights and measures the maximum height of the resulting bounce. The table below shows her data. Draw a scatter plot and find the equation. What is the initial height if the bounce height is 65 cm? What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?
Initial height (cm) | Bounce height (cm) |
---|---|
30 | 22 |
35 | 26 |
40 | 29 |
45 | 34 |
50 | 38 |
55 | 40 |
60 | 45 |
65 | 50 |
70 | 52 |
- The following table shows the median California family income from 1995 to 2002 as reported by the US Census Bureau. Draw a scatter plot and find the equation. What would you expect the median annual income of a Californian family to be in year 2010? What are the meanings of the slope and the \begin{align*}y-\end{align*}intercept in this problem?
Year | Income |
---|---|
1995 | 53807 |
1996 | 55217 |
1997 | 55209 |
1998 | 55415 |
1999 | 63100 |
2000 | 63206 |
2001 | 63761 |
2002 | 65766 |
Review Answers
- \begin{align*}y = 0.9x-0.8\end{align*}
- \begin{align*}y = 1.05x + 6.1\end{align*}
- \begin{align*}y = -0.86x + 24.3\end{align*}
- \begin{align*}y=2x + 6\end{align*}
- \begin{align*}y = .8x + 3.5 \end{align*}
- \begin{align*}y = .96x + 10.83 \end{align*}
- \begin{align*}y = -.8x + 25\end{align*}
- \begin{align*}y= 2.5x +\end{align*} \begin{align*}27.5\end{align*} Solution \begin{align*}y= 57.5\end{align*}. Shiva will beat the record.
- \begin{align*}y = 0.75x -0.5\end{align*} slope = the ratio of bounce height to drop height \begin{align*}y-\end{align*}intercept = how far the ball bounces if it’s dropped from a height of zero. The line is the best fit to the data. We know that dropping it from height of zero should give a bounce of zero and -0.5 cm is pretty close to zero. Drop height \begin{align*}= 83.3 \ cm\end{align*} when bounce height \begin{align*}= 65 \ cm\end{align*}
- \begin{align*}y = 1.75x +53.8\end{align*} \begin{align*}x =\end{align*} years since 1995 \begin{align*}y =\end{align*} Income in thousands of dollars slope = increase in income per year (in thousands) \begin{align*}y-\end{align*}intercept = income in 1995 (in thousands) Income in 2010 is $80050.