<img src="https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=iA1Pi1a8Dy00ym" style="display:none" height="1" width="1" alt="" />

# 9.2: Least-Squares Regression

Difficulty Level: At Grade Created by: CK-12

## Learning Objectives

• Calculate and graph a regression line.
• Predict values using bivariate data plotted on a scatterplot.
• Understand outliers and influential points.
• Perform transformations to achieve linearity.
• Calculate residuals and understand the least-squares property and its relation to the regression equation.
• Plot residuals and test for linearity.

## Introduction

In the last section we learned about the concept of correlation, which we defined as the measure of the linear relationship between two variables. As a reminder, when we have a strong positive correlation, we can expect that if the score on one variable is high, the score on the other variable will also most likely be high. With correlation, we are able to roughly predict the score of one variable when we have the other. Prediction is simply the process of estimating scores of one variable based on the scores of another variable.

In the previous section we illustrated the concept of correlation through scatterplot graphs. We saw that when variables were correlated, the points on this graph tended to follow a straight line. If we could draw this straight line it, in theory, would represent the change in one variable associated with the other. This line is called the least squares or the linear regression line (see figure below).

## Calculating and Graphing the Regression Line

Linear regression involves using existing data to calculate a line that best fits the data and then using that line to predict scores. In linear regression, we use one variable (the predictor variable) to predict the outcome of another (the outcome or the criterion variable). To calculate this line, we analyze the patterns between two variables and use a series of calculations to determine the different parts of the line.

To determine this line we want to find the change in $X$ that will be reflected by the average change in $Y$. After we calculate this average change, we can apply it to any value of $X$ to get an approximation of $Y$. Since the regression line is used to predict the value of $Y$ for any given value of $X$, all predicted values will be located on the regression line itself. Therefore, we try to fit the regression line to the data by having the smallest sum of squared distances from each of the data points to the line itself. In the example below, you can see the calculated distance from each of the observations to the regression line, or residual values. This method of fitting the data line so that there is minimal difference between the observation and the line is called the method of least squares which we will discuss further in the following sections.

As you can see, the regression line is a straight line that expresses the relationship between two variables. When predicting one score by using another, we use an equation equivalent to the slope-intercept form of the equation for a straight line:

$Y = bX + a$

where:

$Y =$ the score that we are trying to predict

$b =$ the slope of the line

$a =$ the $Y$ intercept (value of $Y$ when $X=0$)

While the linear regression equation is equivalent to the slope intercept form $y = mx + b$ (swapping $b$ for $m$ and $a$ for $b$), the form above is often used in statistical regression.

To calculate the line itself, we need to find the values for $b$ (the regression coefficient) and $a$ (the regression constant). The regression coefficient is a very important calculation and explains the nature of the relationship between the two variables. Essentially, the regression coefficient tells us that a certain change in the predictor variables is associated with a $1\%$ change in the outcome or the criterion variable. For example, if we had a regression coefficient of $10.76$, we would say that a “$10.76\%$ change in $X$ is associated with a $1\%$ change in $Y$.” To calculate this regression coefficient we can use the formulas:

$b = \frac{n \sum XY - \sum X \sum Y} {n \sum X^2 - (\sum X)^2}$

or

$b = (r) \frac{s_y} {s_x}$

where:

$r =$ correlation between variables $X$ and $Y$

$s_y =$ standard deviation of the $Y$ scores

$s_x =$ standard deviation of the $X$ scores

In addition to calculating the regression coefficient, we also need to calculate the regression constant. The regression constant is also the $y$-intercept and is the place where the line crosses the $y$-axis. For example, if we had an equation with a regression constant of $4.58$, we would conclude that the regression line crosses the $y$-axis at $4.58$. We use the following formula to calculate the regression constant:

$a = \frac {\sum Y - b \sum X} {n} = \bar Y - b \bar X$

Example:

Find the least squared regression line (also known as the regression line or the line of best fit) for the example measuring the verbal SAT score and GPA that was used in the previous section.

SAT and GPA data including intermediate computations for computing a linear regression.
Student SAT Score $(X)$ GPA $(Y)$ $XY$ $X^2$ $Y^2$
1 $595$ $3.4$ $2023$ $354025$ $11.56$
2 $520$ $3.2$ $1664$ $270400$ $10.24$
3 $715$ $3.9$ $2789$ $511225$ $15.21$
4 $405$ $2.3$ $932$ $164025$ $5.29$
5 $680$ $3.9$ $2652$ $462400$ $15.21$
6 $490$ $2.5$ $1225$ $240100$ $6.25$
7 $565$ $3.5$ $1978$ $319225$ $12.25$
Sum $3970$ $22.7$ $13262$ $2321400$ $76.01$

Using these data, we first calculate the regression coefficient and the regression constant:

$b & = \frac{n \sum XY - \sum X \sum Y} {n \sum X^2 - (\sum X)^2} = \frac{7 \cdot 13,262 - 3,970 \cdot 22.7} {7 \cdot 2,321,400 - 3,970^2}= \frac{2715} {488900} = 0.0056\\a & = \frac{\sum Y - b \sum X} {n} \approx 0.097$

Now that we have the equation of this line, it is easy to plot on a scatterplot. To plot this line, we simply substitute two values of $X$ and calculate the corresponding $Y$ values to get several pairs of coordinates. Let’s say that we wanted to plot this example on a scatterplot. We would choose two hypothetical values for $X$ (say, $400$ and $500$) and then solve for $Y$ in order to identify the coordinates $(400, 2.1214)$ and $(500, 2.6761)$. From these pairs of coordinates, we can draw the regression line on the scatterplot.

## Predicting Values Using Scatterplot Data

One of the uses of the regression line is to predict values. After calculating this line, we are able to predict values by simply substituting a value of a predictor variable $(X)$ into the regression equation and solving the equation for the outcome variable $(Y)$. In our example above, we can predict a students’ GPA from their SAT score by plugging in the desired values into our regression equation $(Y = .0056X - 0.07)$.

For example, say that we wanted to predict the GPA for two students, one of which had an SAT score of $500$ and the other of which had an SAT score of $600$. To predict the GPA scores for these two students, we would simply plug the two values of the predictor variable ($500$ and $600$) into the equation and solve for $Y$ (see below).

GPA/SAT data including predicted GPA values from the linear regression.
Student SAT Score $(X)$ GPA $(Y)$ Predicted GPA $(\hat {Y})$
1 $595$ $3.4$ $3.3$
2 $520$ $3.2$ $2.8$
3 $715$ $3.9$ $3.9$
4 $405$ $2.3$ $2.2$
5 $680$ $3.9$ $3.7$
6 $490$ $2.5$ $2.7$
7 $565$ $3.5$ $3.6$
Hypothetical $600$ $3.4$
Hypothetical $500$ $2.9$

We are able to predict the values for $Y$ for any value of $X$ within a specified range.

## Transformations to Achieve Linearity

Sometimes we find that there is a relationship between $X$ and $Y$, but it is not best summarized by a straight line. When looking at the scatterplot graphs of correlation patterns, we called these types of relationships curvilinear. While many relationships are linear, there are quite a number that are not including learning curves (learning more quickly at the beginning followed by a leveling out) or exponential growth (doubling in size with each unit of growth). Below is an example of a growth curve describing the growth of complex societies.

Since this is not a linear relationship, one may think that we may not be able to fit a regression line. However, we can perform something called a transformation to achieve a linear relationship. We commonly use transformations in everyday life. For example, the Richter scale measuring for earthquake intensity, and the idea of describing pay raises in terms of percentages are both examples of making transformations on non-linear data.

Let’s take a closer look at logarithms so that we can understand how they are used in nonlinear transformations. Notice that we can write the numbers $10,100$ and $1,000$ as $10 = 10^1, 100=10^2, 1,000 = 10^3$, etc. We can also write the numbers $2, 4$, and $8$ as $2 = 2^1, 2=2^2, 2 = 2^3$, etc. All of these equations take the form: $x = c^a$ where $a$ is the power to which the base $(c)$ must be raised. We call $a$ the logarithm because it is the power to which the base must be raised to yield the number. Applying this equation, we find that $\mathrm{log}_{10} 10 = 1, \mathrm{log}_{10} 100 = 2, \mathrm{log}_{10} 1000 = 3$, etc. and $\mathrm{log}_2 2 = 1, \mathrm{log}_2 4 = 2, \mathrm{log}_2 8 = 3$, etc. Because of these rules, variables that are exponential or multiplicative (in other words, non-linear models) are linear in their logarithmic form.

In order to transform data in the linear regression model, we apply logarithmic transformations to each point in the data set. This is most easily done using either the TI-83 calculator or a computer program such as Microsoft Excel, the Statistical Package for Social Sciences (SPSS) or Statistical Analysis Software (SAS). This transformation produces a linear correlation to which we can fit a linear regression line.

Let’s take a look at an example to help clarify this concept. Say that we were interested in making a case for investing and examining how much return on investment one would get on $\100$ over time. Let’s assume that we invested $\100$ in the year 1900 and this money accrued $5\%$ interest every year. The table below details how much we would have each decade:

Table of account growth assuming
Year Investment with $5\%$ Each Year
1900 $100$
1910 $163$
1920 $265$
1930 $432$
1940 $704$
1950 $1147$
1960 $1868$
1970 $3043$
1980 $4956$
1990 $8073$
2000 $13150$
2010 $21420$

If we graphed these data points, we would see that we have an exponential growth curve.

Say that we wanted to fit a linear regression line to these data. First, we would transform these data using logarithmic transformations.

Account growth data and values after a logarithmic transformation.
Year Investment with $5\%$ Each Year Log of amount
1900 $100$ $2$
1910 $163$ $2.211893$
1920 $265$ $2.423786$
1930 $432$ $2.635679$
1940 $704$ $2.847572$
1950 $1147$ $3.059465$
1960 $1868$ $3.271358$
1970 $3043$ $3.483251$
1980 $4956$ $3.695144$
1990 $8073$ $3.907037$
2000 $13150$ $4.11893$
2010 $21420$ $4.330823$

If we graphed these transformed data, we would see that we have a linear relationship.

## Outliers and Influential Points

An outlier is an extreme observation that does not fit the general correlation or regression pattern (see figure below). By definition, an outlier is defined as an unusual observation; therefore, the inclusion of this observation may affect the slope and the intercept of the regression line. When examining the scatterplot graph and calculating the regression equation, it is worth considering whether extreme observations should be included or not.

Let’s use our example above to illustrate the effect of a single outlier. Say that we have a student that has a high GPA, but suffered from test anxiety the morning of the SAT verbal test and scored a $410$. Using our original regression equation, we would expect the student to have a GPA of $2.2$. But in reality, the student has a GPA equal to $3.9$. The inclusion of this value would change the slope of the regression equation from $-0 .0056$ to $-0.0032$ which is quite a large difference.

There is no set rule when trying to decide whether or not to include an outlier in regression analysis. This decision depends on the sample size, how extreme the outlier is and the normality of the distribution. As As a general rule of thumb, we should consider values that are $1.5 \;\mathrm{times}$ the inter-quartile range below the first quartile or above the third quartile as outliers. Extreme outliers are values that are $3.0 \;\mathrm{times}$ the inter-quartile range below the first quartile or above the third quartile.

## Calculating Residuals and Understanding their Relation to the Regression Equation

As mentioned earlier in the lesson, the linear regression line is the line that best fits the given data. Ideally, we would like to minimize the distance of all data points to regression line. These distances are called the error $(e)$ and also known as the residual values. As mentioned, we fit the regression line to the data points in a scatterplot using the least-squares method. A “good” line will have small residuals. Notice in the figure below that this calculated difference is actually the vertical distance between the observation and the predicted value on the regression line.

To find the residual values we subtract the predicted value from the actual value $(e = Y - \hat{Y})$. Theoretically, the sum of all residual values should be $'0'$ since we are finding the line of best fit with the predicted values as close as possible to the actual value. However, since we will have both positive and negative residuals, it does not make much sense to use this sum as an indicator since the residuals cancel each other out and total zero. Therefore, we try to minimize the sum of the squared residuals or $\sum (Y- \hat{Y})^2$.

Example:

Calculate the residuals for the predicted and the actual GPA scores from our sample above.

Solution:

SAT/GPA data including residuals.
Student SAT Score $(X)$ GPA $(Y)$ Predicted GPA $(\hat{Y})$ Residual Value Residual Value Squared
$1$ $595$ $3.4$ $3.4$ $0$ $0$
$2$ $520$ $3.2$ $3.0$ $.2$ $.04$
$3$ $715$ $3.9$ $4.1$ $-.2$ $.04$
$4$ $405$ $2.3$ $2.3$ $0$ $0$
$5$ $680$ $3.9$ $3.9$ $0$ $0$
$6$ $490$ $2.5$ $2.8$ $-.3$ $-.09$
$7$ $565$ $3.5$ $3.2$ $.3$ $.09$
$\sum (Y- \hat{Y})^2$ $.26$

## Plotting Residuals and Testing for Linearity

To test for linearity and when determining if we should drop extreme observations (or outliers) from the analysis, it is helpful to plot the residuals. When plotting, we simply plot the $x$-value for each observation on the $x$ axis and then plot the residual score on the $y$-axis. When examining this scatterplot, the data points should appear to have no correlation with approximately half of the points above $0$ and the other half below $0$. In addition, the points should be evenly distributed along the $x$-axis too. Below is an example of what a residual scatterplot should look like if there are no outliers and a linear relationship.

If the plots of the residuals do not form this sort of pattern, we should exam them a bit more closely. For example, if more observations are below $0$, we may have a positive outlying residual score that is skewing the distribution and vice versa. If the points are clustered close to the $y$-axis, we could have an $x$-value that is an outlier (see below). If this does occur, we may want to consider dropping the observation to see if this would impact the plot of the residuals. If we do decide to drop the observation, we will need to recalculate the original regression line. After this recalculation, we will have a regression line that better fits a majority of the data.

## Lesson Summary

1. Prediction is simply the process of estimating scores on one variable based on the scores of another variable. We use the least-squares (also known as the linear) regression line to predict the value of a variable.
2. Using this regression line, we are able to use the slope, $y$-intercept and the calculated regression coefficient to predict the scores of a variable $(\ddot{Y})$ .
3. When there is a nonlinear relationship, we are able to transform the data using logarithmic and power transformations. Since logarithms and power transformations are exponential in nature, this allows us to produce a linear relationship to which we can fit a regression line.
4. The difference between the actual and the predicted values is called the residual value. We can calculate scatterplots of these residual values to examine outliers and test for linearity.

## Review Questions

The school nurse is interested in predicting scores on a memory test from the number of times that a student exercises per week. Below are her observations:

A table of memory test scores compared to the number of times a student exercises per week.
Student Exercise Per Week Memory Test Score
$1$ $0$ $15$
$2$ $2$ $3$
$3$ $2$ $12$
$4$ $1$ $11$
$5$ $3$ $5$
$6$ $1$ $8$
$7$ $2$ $15$
$8$ $0$ $13$
$9$ $3$ $2$
$10$ $3$ $4$
$11$ $4$ $2$
$12$ $1$ $8$
$13$ $1$ $10$
$14$ $1$ $12$
$15$ $2$ $8$
1. Please plot this data on a scatterplot ($X$ axis – Exercise per week; $Y$ axis – Social Events).
2. Does this appear to be a linear relationship? Why or why not?
3. What regression equation would you use to construct a linear regression model?
4. What is the regression coefficient in this linear regression model and what does this mean in words?
5. Calculate the regression equation for these data.
6. Draw the regression line on the scatterplot.
7. What is the predicted memory test score of a student that exercises $3 \;\mathrm{times}$ per week?
8. Do you think that a data transformation is necessary in order to build an accurate linear regression model? Why or why not?
9. Please calculate the residuals for each of the observations and plot these residuals on a scatterplot.
10. Examine this scatterplot of the residuals. Is a transformation of the data necessary? Why or why not?

1. Answer to the discretion of the teacher.
2. Yes. When plotted, the data appear to be negatively correlated and in a linear pattern.
3. $Y = bX + a$
4. $-2.951$. This regression coefficient means that every $-2.951$ percent change in memory test score is associated with a one percent change in exercise per week.
5. $\hat{Y} = -2.951 X + 13.65$
6. Answer to the discretion of the teacher
7. If a student exercised $3 \;\mathrm{times}$ per week, we would expect that they would have a memory test score of $4.8$.
8. No. A data transformation is not necessary because the relationship between the two variables is linear.
9. See Table Below.
Student Exercise Per Week Memory Test Score Predicted Value Residual Score
$1$ $0$ $15$ $13.7$ $1.4$
$2$ $2$ $3$ $7.7$ $-4.7$
$3$ $2$ $12$ $7.7$ $4.3$
$4$ $1$ $11$ $10.7$ $0.3$
$5$ $3$ $5$ $4.8$ $0.2$
$6$ $1$ $8$ $10.7$ $-2.7$
$7$ $2$ $15$ $7.7$ $7.3$
$8$ $0$ $13$ $13.7$ $-0.7$
$9$ $3$ $2$ $4.8$ $-2.8$
$10$ $3$ $4$ $4.8$ $-0.8$
$11$ $4$ $2$ $1.8$ $0.2$
$12$ $1$ $8$ $10.7$ $-2.7$
$13$ $1$

Feb 23, 2012

Jul 03, 2014