# 6.3: Least-Squares Regression

**At Grade**Created by: Bruce DeWItt

### Learning Objectives

- Construct scatterplots using technology
- Calculate and graph the least-squares regression line using technology
- Calculate the correlation coefficient using technology
- Use the LSRL to make predictions
- Understand interpolation and extrapolation
- Interpret the slope and the y-intercept of the LSRL

### Least-Squares Regression

In the last section we learned about the concept of correlation, which we defined as the measure of the linear relationship between two numerical variables. We saw that when the points of a scatterplot formed a clear linear pattern, then the points were said to have a high correlation. Scatterplots can have a strong correlation in either a positive (increasing to the right) or a negative (decreasing to the right) direction. We have also discussed the idea of drawing a line-of-best-fit through the data. In some scatterplots this is easy to do and all of us would end up with our lines in nearly the same place. However, if everyone were to simply draw a line where they think it fits or to select two of the points to calculate a line through, our lines and equations would certainly vary from person to person. Therefore, we will use a specific formula to calculate the equation for the line-of-best-fit.

Linear regression involves using data to calculate a line that best fits the data and then using that line to predict scores. We will use the **Least-Squares Regression Line (LSRL) -** the line that makes the sum of the squares of the vertical distance of each data point from the line the least possible value. This is the standard regression equation that is used most often. It is the one that your graphing calculator and Excel will calculate for you. The formula and process to calculate this is quite tedious, so we will use technology to find the LSRL equations. The regression equation will be in the form of: \begin{align*}\hat{y}= a + bx\end{align*}, where **a** is the y-intercept and **b** is the slope of the equation. Your calculator will calculate the correlation coefficient (**r**) at the same time as it calculates the LSRL equation. Many will also report a value for r^{2} (which is exactly what it says; r-squared). The **r ^{2}** value is called the coefficient of determination, it reports the percent of variation in our data that is explained by our LSRL equation. We will not be addressing its importance in this course.

To calculate the LSRL equation and correlation coefficient, use a graphing calculator or computer program. See the appendix at the end of this book for the steps to calculate the LSRL and correlation.

### Interpreting the slope and y-intercept

As with all of our statistics, these data, graphs and equations are not meaningless. They represent the relationship between two numerical values measured on several specific individuals. Thus the slope and the y-intercept of our newly calculated regression equation mean something as well. So, we will be interpreting both in context. The **interpretation of the slope** of the regression equation is the average rate of change in the response variable (y), for each increase of one unit of the explanatory variable (x). You will say something like: *For each increase of one (explanatory variable), there will be average (an increase or decrease) of (slope value) in the (response variable).*

The **interpretation of the y-intercept** of the regression equation is the predicted value of the response variable (y) when the explanatory variable (x) is zero. You will say something like: *When (explanatory variable) is zero, the (response variable) is predicted to be (y-intercept value).* You will discover that the interpretation of the y-intercept often makes absolutely no sense when put into context. This is because actual data rarely involves x-values of zero.

#### Example 1

Below is data given by a canine expert. It relates a dog's age in years to what they believe the equivalent age in human years to be.

The scatterplot showing this data, using dog age as the explanatory variable, is below.

a) Calculate the Least-Squares regression line for the Dog Year Data. Report you equation. Be sure to identify your variables.

b) Calculate the correlation (r). What two things does r tell us about this relationship?

c) Identify and interpret the slope in the context of the problem.

d) Identify and interpret the y-intercept in the context of the problem.

#### Solution

a)This was done using Excel, but the graphing calculators will report the same LSRL.

LSRL is: \begin{align*}\hat{y} = 7.795 + 4.642x\end{align*}

x = Dog age in years

y = equivalent human years (predicted)

b)r will be the square-root of r^{2}(The graphing calculators report both r and r^{2}so you would not need to do any calculating, but Excel only gave r^{2}).

\begin{align*}r =\sqrt{r^2}= \sqrt{0.9815}= 0.9907\end{align*}The two things that r tells us are:

Because r is positive, this relationship is increasing. And r is very close to one, so this relationship is very strong.

c)The slope is 4.642. It means that for every increase of one year in dog age, there is an average increase of 4.642 years in the equivalent human age.

d)The y-intercept is 7.795. It means that if a dog were 0 years old, it would be predicted to be 7.795 years in human years.(This is clearly nonsense in this case. It would make sense that both start at zero.)

### Making Predictions

The main use of the regression line is to predict values. After calculating this line, we are able to predict values by simply substituting a value for the explanatory variable (x) and solving the equation for the predicted response value (y). In our example above, we can predict that the human year equivalence for a dog that is 6 years old is approximately 35.6 human years (see equation below). This prediction is reasonable and it matches with our graph. However this is not always the case.

\begin{align*}\hat{y}= 7.795 + 4.642(6) = 35.647\end{align*}

As you look at the LSRL drawn on the above scatterplot, you can see that the points to the far left do not appear to be very linear. So, using the line to the left of about 1 year will not make much sense. Also, we do not have any idea what will happen to the data beyond the 11 years that we have recorded. An LSRL is very useful in making predictions, but only within the range of the actual data that we have collected and can see- this is called **interpolation**. We can see that this line is a reasonably good fit between 1 and 11 dog years, but we simply do not know what happens beyond 11 years (and we cannot use negative years for obvious reasons). The prediction line that we have calculated will go forever in both directions (remember geometry?), but it will not be appropriate to use it to predict for all values of x. Using a regression line to predict values that are outside the range of our actual data is called **extrapolation**. Extrapolation will often yield ridiculous answers! However, even if the result seems reasonable, we should avoid extrapolating because we simply do not know what happens beyond our actual observations. Making decisions based on extrapolating can be dangerous as we are coming to conclusions that are not backed up by data.

#### Example 2

The following table lists the GPA and Verbal SAT Score for seven students. Analyze how well Verbal SAT Scores can be used to predict students' GPAs based on this data.

a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes.

b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables.

c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph?

d) Identify and interpret the slope in the context of the problem.

e) Using your equation, what is the predicted GPA of a student who has a Verbal SAT Score of 500? Of a student with a score of 600?

#### Solution

a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes.

Here is the scatterplot from a TI-84 plus:

Here are the LSRL, correlation, and the scatterplot with the line added to the graph, from a TI-84 plus:

b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables.

LSRL is: \begin{align*}\hat{y} = 0.097 + 0.0055x\end{align*}

x = Verbal SAT Score

y = predicted GPA

c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph?

The correlation is r = +0.9467. This tells us that the relationship is positive and strong.

d) Identify and interpret the slope in the context of this problem.

The slope is 0.0055. This tells us that for each increase of 1 point on the Verbal SAT Score, there will be an average increase of 0.0055 in a student's GPA.e) Using your equation, what is the predicted GPA of a student who has a Verbal SAT Score of 500? Of a student with a score of 600?

\begin{align*}\hat{y} = 0.097 + 0.0055(500) = 2.847 \end{align*}

\begin{align*}\hat{y} = 0.097 + 0.0055(600) = 3.397\end{align*}

So, the predicted GPA for a student who scores 500 on the SAT Verbal, is approximately 2.8.

And, the predicted GPA for a student who scores 600 on the SAT Verbal, is approximately 3.4.

### Outliers and Influential Points

An outlier is an extreme observation that does not fit the general pattern of the data (see the example below). Because an outlier is an extreme observation, the inclusion of it may affect the correlation, and the equation for the least-squares regression line. When examining a scatterplot and calculating the regression equation, it is worth considering whether extreme observations should be included or not.

Let's use our GPA example to illustrate the effect of a single outlier. Suppose that we have a student who has scored very high on the SAT Verbal exam, but has a lower GPA. We will change Corbin's results to be 715 on the SAT and a GPA = 2.2, and see what happens to the LSRL and correlation.

Here are the LSRL equation and the correlation coefficient recalculated with Corbin's GPA changed:

As you can see, this one change turned Corbin into an outlier. This caused the correlation to drop from r = 0.947, all the way down to r = 0.317. This is a huge change- it makes the relationship between the two variables extremely weak (rather that very strong). Also, this changed both the slope and the y-intercept of the LSRL equation dramatically. This means that predictions based on this LSRL will have different results than those based on the LSRL with Corbin's old GPA.

There is no set rule when trying to decide how to deal with outliers in regression analysis, but you can now see how an outlier really can change everything when it comes to scatterplots, correlation and least-squares regression. Be sure to mention any potential outliers that you observe in any scatterplot.

### Problem Set 6.3

#### Section 6.3 Exercises

1) Malia turned the water on in her bathtub full blast. She then measured the depth of the water every two minutes until the bathtub was full. Her findings are listed in the following table. In section 6.1 we constructed a scatterplot and described the plot, we are now going to analyze this data further.

a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes.

b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables.

c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph?

d) Identify and interpret the slope in the context of the problem.

e) Using your equation, what is the predicted depth of the water after 17 minutes? After one hour?

f) Are your answers in (e) reasonable? Why or why not?

2) The following table shows the progression of the Federal Minimum Wage in the United States since 1938 (source:http://www.laborlawcenter.com). We are going to analyze the relationship between year and minimum wage to see if there is a predictable relationship between the variables.

a) Using

year onlyas the explanatory variable (ignore month & day), construct a scatterplot. Sketch the graph that the calculator shows. Be sure to label your axes.

b) Describe the relationship between the two variables. (S.C.O.F.D.)

c) Calculate the Least-Squares Regression Line (LSRL). Add the line to your graph and report your equation. Be sure to identify your variables.

d) Calculate the correlation (r). Even though r is very high, do you feel that a line is the best model for this data? Why or why not?

e) Based on your model, what would you predict the Federal Minimum Wage to be in 2012? Is this an accurate prediction? Why or why not?

f) Based on your model, what would you predict the minimum wage to have been in 1968? How close is this to the actual minimum wage that year?

3) Suppose that some researchers analyzed the relationship between fathers' and sons' IQ scores for a group of men. Suppose further that they discovered that the relationship was reasonably linear and they calculated a regression line of \begin{align*}\hat{y}= 12 + 0.9x \end{align*} ; where x = father's IQ and y = son's IQ.

a) Identify the explanatory and response variables.

b) Identify and interpret the slope in the context of the problem.

c) Identify and interpret the y-intercept in the context of the problem.

d) Do your answers to (b) and (c) seem reasonable? Why or why not?

e) What would you predict a son's IQ to be if his father has an IQ of 120? What if the father had an IQ of 140?

f) If you knew that the original data included fathers with IQs from 108 to 145, explain why it would be inappropriate to use your model to predict a son's IQ if his father's IQ were 170.

#### Review Exercises

(for 4 - 7) Suppose that Marco, the star of the basketball team, makes 79% of the free-throws that he attempts. Assuming that each free-throw is independent, answer the following questions.

4) What is the probability that Marco will make three free-throws in a row?

5) What is the probability that Marco will make exactly two out of three free-throws?

6) What is the probability that Marco will miss at least one of his next four free-throws?

7) If you were going to set up a simulation to estimate this scenario, which of the following would **not** be an appropriate way to assign the digits?

A. 01-79 represents

makes, 80-99 & 00 representsmissesB. 01-21 represents

misses, 22-99 & 00 representsmakes

C. 00-79 represents

makes, 80-99 representsmisses

D. 00-20 represents

misses, 21-99 representsmakes

E. 00-78 represents

makes, 79-99 representsmisses