9.2: Least-Squares Regression
Learning Objectives
- Calculate and graph a regression line.
- Predict values using bivariate data plotted on a scatterplot.
- Understand outliers and influential points.
- Perform transformations to achieve linearity.
- Calculate residuals and understand the least-squares property and its relation to the regression equation.
- Plot residuals and test for linearity.
Introduction
In the last section, we learned about the concept of correlation, which we defined as the measure of the linear relationship between two variables. As a reminder, when we have a strong positive correlation, we can expect that if the score on one variable is high, the score on the other variable will also most likely be high. With correlation, we are able to roughly predict the score of one variable when we have the other. Prediction is simply the process of estimating scores of one variable based on the scores of another variable.
In the previous section, we illustrated the concept of correlation through scatterplot graphs. We saw that when variables were correlated, the points on a scatterplot graph tended to follow a straight line. If we could draw this straight line, it would, in theory, represent the change in one variable associated with the change in the other. This line is called the least squares line, or the linear regression line (see figure below).
Calculating and Graphing the Regression Line
Linear regression involves using data to calculate a line that best fits that data and then using that line to predict scores. In linear regression, we use one variable (the predictor variable) to predict the outcome of another (the outcome variable, or criterion variable). To calculate this line, we analyze the patterns between the two variables.
We are looking for a line of best fit, and there are many ways one could define this best fit. Statisticians define this line to be the one which minimizes the sum of the squared distances from the observed data to the line.
To determine this line, we want to find the change in \begin{align*}X\end{align*} that will be reflected by the average change in \begin{align*}Y\end{align*}. After we calculate this average change, we can apply it to any value of \begin{align*}X\end{align*} to get an approximation of \begin{align*}Y\end{align*}. Since the regression line is used to predict the value of \begin{align*}Y\end{align*} for any given value of \begin{align*}X\end{align*}, all predicted values will be located on the regression line, itself. Therefore, we try to fit the regression line to the data by having the smallest sum of squared distances possible from each of the data points to the line. In the example below, you can see the calculated distances, or residual values, from each of the observations to the regression line. This method of fitting the data line so that there is minimal difference between the observations and the line is called the method of least squares, which we will discuss further in the following sections.
As you can see, the regression line is a straight line that expresses the relationship between two variables. When predicting one score by using another, we use an equation such as the following, which is equivalent to the slope-intercept form of the equation for a straight line:
\begin{align*}Y = bX + a\end{align*}
where:
\begin{align*}Y\end{align*} is the score that we are trying to predict.
\begin{align*}b\end{align*} is the slope of the line.
\begin{align*}a\end{align*} is the \begin{align*}y\end{align*}-intercept, or the value of \begin{align*}Y\end{align*} when the value of \begin{align*}X\end{align*} is 0.
To calculate the line itself, we need to find the values for \begin{align*}b\end{align*} (the regression coefficient) and \begin{align*}a\end{align*} (the regression constant). The regression coefficient explains the nature of the relationship between the two variables. Essentially, the regression coefficient tells us that a certain change in the predictor variable is associated with a certain change in the outcome, or criterion, variable. For example, if we had a regression coefficient of 10.76, we would say that a change of 1 unit in \begin{align*}X\end{align*} is associated with a change of 10.76 units of \begin{align*}Y\end{align*}. To calculate this regression coefficient, we can use the following formulas:
\begin{align*}b & = \frac{n\sum xy-\sum x \sum y}{n \sum x^2-\left ( \sum x \right )^2}\\ \text{or}\\ b & = (r) \frac{s_Y}{s_X}\end{align*}
where:
\begin{align*}r\end{align*} is the correlation between the variables \begin{align*}X\end{align*} and \begin{align*}Y\end{align*}.
\begin{align*}s_Y\end{align*} is the standard deviation of the \begin{align*}Y\end{align*} scores.
\begin{align*}s_X\end{align*} is the standard deviation of the \begin{align*}X\end{align*} scores.
In addition to calculating the regression coefficient, we also need to calculate the regression constant. The regression constant is also the \begin{align*}y\end{align*}-intercept and is the place where the line crosses the \begin{align*}y\end{align*}-axis. For example, if we had an equation with a regression constant of 4.58, we would conclude that the regression line crosses the \begin{align*}y\end{align*}-axis at 4.58. We use the following formula to calculate the regression constant:
\begin{align*}a = \frac{\sum y - b \sum x}{n} = \bar{y}-b\bar{x}\end{align*}
Example: Find the least squares line (also known as the linear regression line or the line of best fit) for the example measuring the verbal SAT scores and GPAs of students that was used in the previous section.
Student | SAT Score (\begin{align*}X\end{align*}) | GPA (\begin{align*}Y\end{align*}) | \begin{align*}xy\end{align*} | \begin{align*}x^2\end{align*} | \begin{align*}y^2\end{align*} |
---|---|---|---|---|---|
1 | 595 | 3.4 | 2023 | 354025 | 11.56 |
2 | 520 | 3.2 | 1664 | 270400 | 10.24 |
3 | 715 | 3.9 | 2789 | 511225 | 15.21 |
4 | 405 | 2.3 | 932 | 164025 | 5.29 |
5 | 680 | 3.9 | 2652 | 462400 | 15.21 |
6 | 490 | 2.5 | 1225 | 240100 | 6.25 |
7 | 565 | 3.5 | 1978 | 319225 | 12.25 |
Sum | 3970 | 22.7 | 13262 | 2321400 | 76.01 |
Using these data points, we first calculate the regression coefficient and the regression constant as follows:
\begin{align*}b & = \frac{n \sum xy - \sum x \sum y}{n \sum x^2-\left ( \sum x \right )^2} = \frac{(7)(13,262)-(3,970)(22.7)}{(7)(2,321,400)-3,970^2} = \frac{2715}{488900} \approx 0.0055\\ a & =\frac{\sum y-b \sum x}{n} \approx 0.097\end{align*}
Note: If you performed the calculations yourself and did not get exactly the same answers, it is probably due to rounding in the table for \begin{align*}xy\end{align*}.
Now that we have the equation of this line, it is easy to plot on a scatterplot. To plot this line, we simply substitute two values of \begin{align*}X\end{align*} and calculate the corresponding \begin{align*}Y\end{align*} values to get two pairs of coordinates. Let’s say that we wanted to plot this example on a scatterplot. We would choose two hypothetical values for \begin{align*}X\end{align*} (say, 400 and 500) and then solve for \begin{align*}Y\end{align*} in order to identify the coordinates (400, 2.1214) and (500, 2.6761). From these pairs of coordinates, we can draw the regression line on the scatterplot.
Predicting Values Using Scatterplot Data
One of the uses of a regression line is to predict values. After calculating this line, we are able to predict values by simply substituting a value of a predictor variable, \begin{align*}X\end{align*}, into the regression equation and solving the equation for the outcome variable, \begin{align*}Y\end{align*}. In our example above, we can predict the students’ GPA's from their SAT scores by plugging in the desired values into our regression equation, \begin{align*}Y=0.0055X+0.097\end{align*}.
For example, say that we wanted to predict the GPA for two students, one who had an SAT score of 500 and the other who had an SAT score of 600. To predict the GPA scores for these two students, we would simply plug the two values of the predictor variable into the equation and solve for \begin{align*}Y\end{align*} (see below).
Student | SAT Score (\begin{align*}X\end{align*}) | GPA (\begin{align*}Y\end{align*}) | Predicted GPA (\begin{align*}\hat{Y}\end{align*}) |
---|---|---|---|
1 | 595 | 3.4 | 3.4 |
2 | 520 | 3.2 | 3.0 |
3 | 715 | 3.9 | 4.1 |
4 | 405 | 2.3 | 2.3 |
5 | 680 | 3.9 | 3.9 |
6 | 490 | 2.5 | 2.8 |
7 | 565 | 3.5 | 3.2 |
Hypothetical | 600 | 3.4 | |
Hypothetical | 500 | 2.9 |
As you can see, we are able to predict the value for \begin{align*}Y\end{align*} for any value of \begin{align*}X\end{align*} within a specified range.
Outliers and Influential Points
An outlier is an extreme observation that does not fit the general correlation or regression pattern (see figure below). Since it is an unusual observation, the inclusion of an outlier may affect the slope and the \begin{align*}y\end{align*}-intercept of the regression line. When examining a scatterplot graph and calculating the regression equation, it is worth considering whether extreme observations should be included or not. In the following scatterplot, the outlier has approximate coordinates of (30, 6,000).
Let’s use our example above to illustrate the effect of a single outlier. Say that we have a student who has a high GPA but who suffered from test anxiety the morning of the SAT verbal test and scored a 410. Using our original regression equation, we would expect the student to have a GPA of 2.2. But, in reality, the student has a GPA equal to 3.9. The inclusion of this value would change the slope of the regression equation from 0.0055 to 0.0032, which is quite a large difference.
There is no set rule when trying to decide whether or not to include an outlier in regression analysis. This decision depends on the sample size, how extreme the outlier is, and the normality of the distribution. As a general rule of thumb, we should consider values that are 1.5 times the inter-quartile range below the first quartile or above the third quartile as outliers. Extreme outliers are values that are 3.0 times the inter-quartile range below the first quartile or above the third quartile.
Transformations to Achieve Linearity
Sometimes we find that there is a relationship between \begin{align*}X\end{align*} and \begin{align*}Y\end{align*}, but it is not best summarized by a straight line. When looking at the scatterplot graphs of correlation patterns, these relationships would be shown to be curvilinear. While many relationships are linear, there are quite a number that are not, including learning curves (learning more quickly at the beginning, followed by a leveling out) and exponential growth (doubling in size, for example, with each unit of growth). Below is an example of a growth curve describing the growth of a complex society:
Since this is not a linear relationship, we cannot immediately fit a regression line to this data. However, we can perform a transformation to achieve a linear relationship. We commonly use transformations in everyday life. For example, the Richter scale, which measures earthquake intensity, and the idea of describing pay raises in terms of percentages are both examples of making transformations of non-linear data.
Consider the following exponential relationship, and take the log of both sides as shown:
\begin{align*}y & = ab^x\\ \log y & = \log (ab^x)\\ \log y & = \log a + \log b^x\\ \log y & = \log a + x \log b\end{align*}
In this example, \begin{align*}a\end{align*} and \begin{align*}b\end{align*} are real numbers (constants), so this is now a linear relationship between the variables \begin{align*}x\end{align*} and \begin{align*}\log y\end{align*}.
Thus, you can find a least squares line for these variables.
Let’s take a look at an example to help clarify this concept. Say that we were interested in making a case for investing and examining how much return on investment one would get on $100 over time. Let’s assume that we invested $100 in the year 1900 and that this money accrued 5% interest every year. The table below details how much we would have each decade:
Year | Investment with 5% Each Year |
---|---|
1900 | 100 |
1910 | 163 |
1920 | 265 |
1930 | 432 |
1940 | 704 |
1950 | 1147 |
1960 | 1868 |
1970 | 3043 |
1980 | 4956 |
1990 | 8073 |
2000 | 13150 |
2010 | 21420 |
If we graphed these data points, we would see that we have an exponential growth curve.
Say that we wanted to fit a linear regression line to these data. First, we would transform these data using logarithmic transformations as follows:
Year | Investment with 5% Each Year | Log of amount |
---|---|---|
1900 | 100 | 2 |
1910 | 163 | 2.211893 |
1920 | 265 | 2.423786 |
1930 | 432 | 2.635679 |
1940 | 704 | 2.847572 |
1950 | 1147 | 3.059465 |
1960 | 1868 | 3.271358 |
1970 | 3043 | 3.483251 |
1980 | 4956 | 3.695144 |
1990 | 8073 | 3.907037 |
2000 | 13150 | 4.118930 |
2010 | 21420 | 4.330823 |
If we plotted these transformed data points, we would see that we have a linear relationship as shown below:
We can now perform a linear regression on (year, log of amount). If you enter the data into the TI-83/84 calculator, press [STAT], go to the CALC menu, and use the 'LinReg(ax+b)' command, you find the following relationship:
\begin{align*}Y = 0.021X-38.2\end{align*}
with \begin{align*}X\end{align*} representing year and \begin{align*}Y\end{align*} representing log of amount.
Calculating Residuals and Understanding their Relation to the Regression Equation
Recall that the linear regression line is the line that best fits the given data. Ideally, we would like to minimize the distances of all data points to the regression line. These distances are called the error, \begin{align*}e\end{align*}, and are also known as the residual values. As mentioned, we fit the regression line to the data points in a scatterplot using the least-squares method. A good line will have small residuals. Notice in the figure below that the residuals are the vertical distances between the observations and the predicted values on the regression line:
To find the residual values, we subtract the predicted values from the actual values, so \begin{align*}e=y-\hat{y}\end{align*}. Theoretically, the sum of all residual values is zero, since we are finding the line of best fit, with the predicted values as close as possible to the actual value. It does not make sense to use the sum of the residuals as an indicator of the fit, since, again, the negative and positive residuals always cancel each other out to give a sum of zero. Therefore, we try to minimize the sum of the squared residuals, or \begin{align*}\sum (y-\hat{y})^2\end{align*}.
Example: Calculate the residuals for the predicted and the actual GPA's from our sample above.
Student | SAT Score \begin{align*}(X)\end{align*} | GPA \begin{align*}(Y)\end{align*} | Predicted GPA \begin{align*}(\hat{Y})\end{align*} | Residual Value | Residual Value Squared |
---|---|---|---|---|---|
1 | 595 | 3.4 | 3.4 | 0 | 0 |
2 | 520 | 3.2 | 3.0 | 0.2 | 0.04 |
3 | 715 | 3.9 | 4.1 | \begin{align*}-0.2\end{align*} | 0.04 |
4 | 405 | 2.3 | 2.3 | 0 | 0 |
5 | 680 | 3.9 | 3.9 | 0 | 0 |
6 | 490 | 2.5 | 2.8 | \begin{align*}-0.3\end{align*} | 0.09 |
7 | 565 | 3.5 | 3.2 | 0.3 | 0.09 |
\begin{align*}\sum (y-\hat{y})^2\end{align*} | 0.26 |
Plotting Residuals and Testing for Linearity
To test for linearity and to determine if we should drop extreme observations (or outliers) from our analysis, it is helpful to plot the residuals. When plotting, we simply plot the \begin{align*}x\end{align*}-value for each observation on the \begin{align*}x\end{align*}-axis and then plot the residual score on the \begin{align*}y\end{align*}-axis. When examining this scatterplot, the data points should appear to have no correlation, with approximately half of the points above 0 and the other half below 0. In addition, the points should be evenly distributed along the \begin{align*}x\end{align*}-axis. Below is an example of what a residual scatterplot should look like if there are no outliers and a linear relationship.
If the scatterplot of the residuals does not look similar to the one shown, we should look at the situation a bit more closely. For example, if more observations are below 0, we may have a positive outlying residual score that is skewing the distribution, and if more of the observations are above 0, we may have a negative outlying residual score. If the points are clustered close to the \begin{align*}y\end{align*}-axis, we could have an \begin{align*}x\end{align*}-value that is an outlier. If this occurs, we may want to consider dropping the observation to see if this would impact the plot of the residuals. If we do decide to drop the observation, we will need to recalculate the original regression line. After this recalculation, we will have a regression line that better fits a majority of the data.
Lesson Summary
Prediction is simply the process of estimating scores of one variable based on the scores of another variable. We use the least-squares regression line, or linear regression line, to predict the value of a variable.
Using this regression line, we are able to use the slope, \begin{align*}y\end{align*}-intercept, and the calculated regression coefficient to predict the scores of a variable. The predictions are represented by the variable \begin{align*}\hat{y}\end{align*}.
When there is an exponential relationship between the variables, we can transform the data by taking the log of the dependent variable to achieve linearity between \begin{align*}x\end{align*} and \begin{align*}\log y\end{align*}. We can then fit a least squares regression line to the transformed data.
The differences between the actual and the predicted values are called residual values. We can construct scatterplots of these residual values to examine outliers and test for linearity.
Multimedia Links
For an introduction to what a least squares regression line represents (12.0), see bionicturtledotcom, Introduction to Linear Regression (5:15).
Review Questions
- A school nurse is interested in predicting scores on a memory test from the number of times that a student exercises per week. Below are her observations:
Student | Exercise Per Week | Memory Test Score |
---|---|---|
1 | 0 | 15 |
2 | 2 | 3 |
3 | 2 | 12 |
4 | 1 | 11 |
5 | 3 | 5 |
6 | 1 | 8 |
7 | 2 | 15 |
8 | 0 | 13 |
9 | 3 | 2 |
10 | 3 | 4 |
11 | 4 | 2 |
12 | 1 | 8 |
13 | 1 | 10 |
14 | 1 | 12 |
15 | 2 | 8 |
(a) Plot this data on a scatterplot, with the \begin{align*}x\end{align*}-axis representing the number of times exercising per week and the \begin{align*}y\end{align*}-axis representing memory test score.
(b) Does this appear to be a linear relationship? Why or why not?
(c) What regression equation would you use to construct a linear regression model?
(d) What is the regression coefficient in this linear regression model and what does this mean in words?
(e) Calculate the regression equation for these data.
(f) Draw the regression line on the scatterplot.
(g) What is the predicted memory test score of a student who exercises 3 times per week?
(h) Do you think that a data transformation is necessary in order to build an accurate linear regression model? Why or why not?
(i) Calculate the residuals for each of the observations and plot these residuals on a scatterplot.
(j) Examine this scatterplot of the residuals. Is a transformation of the data necessary? Why or why not?