9.2: LeastSquares Regression
Learning Objectives
 Calculate and graph a regression line.
 Predict values using bivariate data plotted on a scatterplot.
 Understand outliers and influential points.
 Perform transformations to achieve linearity.
 Calculate residuals and understand the leastsquares property and its relation to the regression equation.
 Plot residuals and test for linearity.
Introduction
In the last section, we learned about the concept of correlation, which we defined as the measure of the linear relationship between two variables. As a reminder, when we have a strong positive correlation, we can expect that if the score on one variable is high, the score on the other variable will also most likely be high. With correlation, we are able to roughly predict the score of one variable when we have the other. Prediction is simply the process of estimating scores of one variable based on the scores of another variable.
In the previous section, we illustrated the concept of correlation through scatterplot graphs. We saw that when variables were correlated, the points on a scatterplot graph tended to follow a straight line. If we could draw this straight line, it would, in theory, represent the change in one variable associated with the change in the other. This line is called the least squares line, or the linear regression line (see figure below).
Calculating and Graphing the Regression Line
Linear regression involves using data to calculate a line that best fits that data and then using that line to predict scores. In linear regression, we use one variable (the predictor variable) to predict the outcome of another (the outcome variable, or criterion variable). To calculate this line, we analyze the patterns between the two variables.
We are looking for a line of best fit, and there are many ways one could define this best fit. Statisticians define this line to be the one which minimizes the sum of the squared distances from the observed data to the line.
To determine this line, we want to find the change in
As you can see, the regression line is a straight line that expresses the relationship between two variables. When predicting one score by using another, we use an equation such as the following, which is equivalent to the slopeintercept form of the equation for a straight line:
where:
To calculate the line itself, we need to find the values for
where:
In addition to calculating the regression coefficient, we also need to calculate the regression constant. The regression constant is also the
Example: Find the least squares line (also known as the linear regression line or the line of best fit) for the example measuring the verbal SAT scores and GPAs of students that was used in the previous section.
Student 
SAT Score ( 
GPA ( 




1  595  3.4  2023  354025  11.56 
2  520  3.2  1664  270400  10.24 
3  715  3.9  2789  511225  15.21 
4  405  2.3  932  164025  5.29 
5  680  3.9  2652  462400  15.21 
6  490  2.5  1225  240100  6.25 
7  565  3.5  1978  319225  12.25 
Sum  3970  22.7  13262  2321400  76.01 
Using these data points, we first calculate the regression coefficient and the regression constant as follows:
Note: If you performed the calculations yourself and did not get exactly the same answers, it is probably due to rounding in the table for
Now that we have the equation of this line, it is easy to plot on a scatterplot. To plot this line, we simply substitute two values of
Predicting Values Using Scatterplot Data
One of the uses of a regression line is to predict values. After calculating this line, we are able to predict values by simply substituting a value of a predictor variable,
For example, say that we wanted to predict the GPA for two students, one who had an SAT score of 500 and the other who had an SAT score of 600. To predict the GPA scores for these two students, we would simply plug the two values of the predictor variable into the equation and solve for
Student 
SAT Score ( 
GPA ( 
Predicted GPA ( 

1  595  3.4  3.4 
2  520  3.2  3.0 
3  715  3.9  4.1 
4  405  2.3  2.3 
5  680  3.9  3.9 
6  490  2.5  2.8 
7  565  3.5  3.2 
Hypothetical  600  3.4  
Hypothetical  500  2.9 
As you can see, we are able to predict the value for
Outliers and Influential Points
An outlier is an extreme observation that does not fit the general correlation or regression pattern (see figure below). Since it is an unusual observation, the inclusion of an outlier may affect the slope and the
Let’s use our example above to illustrate the effect of a single outlier. Say that we have a student who has a high GPA but who suffered from test anxiety the morning of the SAT verbal test and scored a 410. Using our original regression equation, we would expect the student to have a GPA of 2.2. But, in reality, the student has a GPA equal to 3.9. The inclusion of this value would change the slope of the regression equation from 0.0055 to 0.0032, which is quite a large difference.
There is no set rule when trying to decide whether or not to include an outlier in regression analysis. This decision depends on the sample size, how extreme the outlier is, and the normality of the distribution. As a general rule of thumb, we should consider values that are 1.5 times the interquartile range below the first quartile or above the third quartile as outliers. Extreme outliers are values that are 3.0 times the interquartile range below the first quartile or above the third quartile.
Transformations to Achieve Linearity
Sometimes we find that there is a relationship between \begin{align*}X\end{align*}
Since this is not a linear relationship, we cannot immediately fit a regression line to this data. However, we can perform a transformation to achieve a linear relationship. We commonly use transformations in everyday life. For example, the Richter scale, which measures earthquake intensity, and the idea of describing pay raises in terms of percentages are both examples of making transformations of nonlinear data.
Consider the following exponential relationship, and take the log of both sides as shown:
\begin{align*}y & = ab^x\\
\log y & = \log (ab^x)\\
\log y & = \log a + \log b^x\\
\log y & = \log a + x \log b\end{align*}
In this example, \begin{align*}a\end{align*}
Thus, you can find a least squares line for these variables.
Let’s take a look at an example to help clarify this concept. Say that we were interested in making a case for investing and examining how much return on investment one would get on $100 over time. Let’s assume that we invested $100 in the year 1900 and that this money accrued 5% interest every year. The table below details how much we would have each decade:
Year  Investment with 5% Each Year 

1900  100 
1910  163 
1920  265 
1930  432 
1940  704 
1950  1147 
1960  1868 
1970  3043 
1980  4956 
1990  8073 
2000  13150 
2010  21420 
If we graphed these data points, we would see that we have an exponential growth curve.
Say that we wanted to fit a linear regression line to these data. First, we would transform these data using logarithmic transformations as follows:
Year  Investment with 5% Each Year  Log of amount 

1900  100  2 
1910  163  2.211893 
1920  265  2.423786 
1930  432  2.635679 
1940  704  2.847572 
1950  1147  3.059465 
1960  1868  3.271358 
1970  3043  3.483251 
1980  4956  3.695144 
1990  8073  3.907037 
2000  13150  4.118930 
2010  21420  4.330823 
If we plotted these transformed data points, we would see that we have a linear relationship as shown below:
We can now perform a linear regression on (year, log of amount). If you enter the data into the TI83/84 calculator, press [STAT], go to the CALC menu, and use the 'LinReg(ax+b)' command, you find the following relationship:
\begin{align*}Y = 0.021X38.2\end{align*}
with \begin{align*}X\end{align*}
Calculating Residuals and Understanding their Relation to the Regression Equation
Recall that the linear regression line is the line that best fits the given data. Ideally, we would like to minimize the distances of all data points to the regression line. These distances are called the error, \begin{align*}e\end{align*}
To find the residual values, we subtract the predicted values from the actual values, so \begin{align*}e=y\hat{y}\end{align*}
Example: Calculate the residuals for the predicted and the actual GPA's from our sample above.
Student  SAT Score \begin{align*}(X)\end{align*}  GPA \begin{align*}(Y)\end{align*}  Predicted GPA \begin{align*}(\hat{Y})\end{align*}  Residual Value  Residual Value Squared 

1  595  3.4  3.4  0  0 
2  520  3.2  3.0  0.2  0.04 
3  715  3.9  4.1  \begin{align*}0.2\end{align*}  0.04 
4  405  2.3  2.3  0  0 
5  680  3.9  3.9  0  0 
6  490  2.5  2.8  \begin{align*}0.3\end{align*}  0.09 
7  565  3.5  3.2  0.3  0.09 
\begin{align*}\sum (y\hat{y})^2\end{align*}  0.26 
Plotting Residuals and Testing for Linearity
To test for linearity and to determine if we should drop extreme observations (or outliers) from our analysis, it is helpful to plot the residuals. When plotting, we simply plot the \begin{align*}x\end{align*}value for each observation on the \begin{align*}x\end{align*}axis and then plot the residual score on the \begin{align*}y\end{align*}axis. When examining this scatterplot, the data points should appear to have no correlation, with approximately half of the points above 0 and the other half below 0. In addition, the points should be evenly distributed along the \begin{align*}x\end{align*}axis. Below is an example of what a residual scatterplot should look like if there are no outliers and a linear relationship.
If the scatterplot of the residuals does not look similar to the one shown, we should look at the situation a bit more closely. For example, if more observations are below 0, we may have a positive outlying residual score that is skewing the distribution, and if more of the observations are above 0, we may have a negative outlying residual score. If the points are clustered close to the \begin{align*}y\end{align*}axis, we could have an \begin{align*}x\end{align*}value that is an outlier. If this occurs, we may want to consider dropping the observation to see if this would impact the plot of the residuals. If we do decide to drop the observation, we will need to recalculate the original regression line. After this recalculation, we will have a regression line that better fits a majority of the data.
Lesson Summary
Prediction is simply the process of estimating scores of one variable based on the scores of another variable. We use the leastsquares regression line, or linear regression line, to predict the value of a variable.
Using this regression line, we are able to use the slope, \begin{align*}y\end{align*}intercept, and the calculated regression coefficient to predict the scores of a variable. The predictions are represented by the variable \begin{align*}\hat{y}\end{align*}.
When there is an exponential relationship between the variables, we can transform the data by taking the log of the dependent variable to achieve linearity between \begin{align*}x\end{align*} and \begin{align*}\log y\end{align*}. We can then fit a least squares regression line to the transformed data.
The differences between the actual and the predicted values are called residual values. We can construct scatterplots of these residual values to examine outliers and test for linearity.
Multimedia Links
For an introduction to what a least squares regression line represents (12.0), see bionicturtledotcom, Introduction to Linear Regression (5:15).
Review Questions
 A school nurse is interested in predicting scores on a memory test from the number of times that a student exercises per week. Below are her observations:
Student  Exercise Per Week  Memory Test Score 

1  0  15 
2  2  3 
3  2  12 
4  1  11 
5  3  5 
6  1  8 
7  2  15 
8  0  13 
9  3  2 
10  3  4 
11  4  2 
12  1  8 
13  1  10 
14  1  12 
15  2  8 
(a) Plot this data on a scatterplot, with the \begin{align*}x\end{align*}axis representing the number of times exercising per week and the \begin{align*}y\end{align*}axis representing memory test score.
(b) Does this appear to be a linear relationship? Why or why not?
(c) What regression equation would you use to construct a linear regression model?
(d) What is the regression coefficient in this linear regression model and what does this mean in words?
(e) Calculate the regression equation for these data.
(f) Draw the regression line on the scatterplot.
(g) What is the predicted memory test score of a student who exercises 3 times per week?
(h) Do you think that a data transformation is necessary in order to build an accurate linear regression model? Why or why not?
(i) Calculate the residuals for each of the observations and plot these residuals on a scatterplot.
(j) Examine this scatterplot of the residuals. Is a transformation of the data necessary? Why or why not?