### Inferences and Assumptions about Linear Regression

**Hypothesis Testing for Linear** Relationships

Let’s think for a minute about the relationship between correlation and the linear regression model. As we learned, if there is no correlation between the two variables \begin{align*}X\end{align*} and \begin{align*}Y\end{align*}, then it would be nearly impossible to fit a meaningful regression line to the points on a scatterplot graph. If there was no correlation, and our correlation value, or \begin{align*}r\end{align*}-value, was 0, we would always come up with the same predicted value, which would be the mean of all the predicted values, or the mean of \begin{align*}\hat{Y}\end{align*}. The figure below shows an example of what a regression line fit to variables with no correlation \begin{align*}(r=0)\end{align*} would look like. As you can see, for any value of \begin{align*}X\end{align*}, we always get the same predicted value of \begin{align*}Y\end{align*}.

Using this knowledge, we can determine that if there is no relationship between \begin{align*}X\end{align*} and \begin{align*}Y\end{align*}, constructing a regression line doesn’t help us very much, because, again, the predicted score would always be the same. Therefore, when we estimate a linear regression model, we want to ensure that the regression coefficient, \begin{align*}\beta\end{align*}, for the population does not equal zero. Furthermore, it is beneficial to test how strong (or far away) from zero the regression coefficient must be to strengthen our prediction of the \begin{align*}Y\end{align*} scores.

In hypothesis testing of linear regression models, the null hypothesis to be tested is that the regression coefficient, \begin{align*}\beta\end{align*}, equals zero. Our alternative hypothesis is that our regression coefficient does not equal zero.

\begin{align*}H_0 : \ \beta & = 0\\ H_a : \ \beta & \neq 0\end{align*}

The test statistic for this hypothesis test is calculated as follows:

\begin{align*}t &= \frac{b-\beta}{s_b}\\ \text{where} \qquad s_b &= \frac{s}{\sqrt{\sum (x-\bar{x})^2}} = \frac{s}{\sqrt{SS_X}},\\ s &= \sqrt{\frac{SSE}{n-2}}, \text{ and}\\ SSE &= \text{sum of residual error squared}\end{align*}

#### Testing the Accuracy of a Regression Equation

Let’s say that a football coach is using the results from a short physical fitness test to predict the results of a longer, more comprehensive one. He developed the regression equation \begin{align*}Y=0.635X + 1.22\end{align*}, and the standard error of estimate is 0.56. The summary statistics are as follows:

Summary statistics for two foot ball fitness tests.

\begin{align*}& n = 24 && \sum xy = 591.50\\ &\sum x = 118 && \sum y = 104.3\\ &\bar{x} = 4.92 && \bar{y}=4.35\\ &\sum x^2 = 704 && \sum y^2 = 510.01\\ & SS_X = 123.83 && SS_Y = 56.74\end{align*}

Using \begin{align*}\alpha=0.05\end{align*}, test the null hypothesis that, in the population, the regression coefficient is zero, or \begin{align*}H_0: \ \beta=0\end{align*}.

We use the \begin{align*}t\end{align*}-distribution to calculate the test statistic and find that the critical values in the \begin{align*}t\end{align*}-distribution at 22 degrees of freedom are 2.074 standard scores above and below the mean. Also, the test statistic can be calculated as follows:

\begin{align*}s_b & = \frac{0.56}{\sqrt{123.83}} = 0.05\\ t & = \frac{0.635-0}{0.05} = 12.70\end{align*}

Since the observed value of the test statistic exceeds the critical value, the null hypothesis would be rejected, and we can conclude that if the null hypothesis were true, we would observe a regression coefficient of 0.635 by chance less than 5% of the time.

**Making Inferences about** Predicted **Scores**

As we have mentioned, a regression line makes predictions about variables based on the relationship of the existing data. However, it is important to remember that the regression line simply infers, or estimates, what the value will be. These predictions are never accurate 100% of the time, unless there is a perfect correlation. What this means is that for every predicted value, we have a normal distribution (also known as the **conditional distribution**, since it is conditional on the \begin{align*}X\end{align*} value) that describes the likelihood of obtaining other scores that are associated with the value of the predictor variable, \begin{align*}X\end{align*}.

If we assume that these distributions are normal, we are able to make inferences about each of the predicted scores. We can ask questions like, “If the predictor variable, \begin{align*}X\end{align*}, equals 4, what percentage of the distribution of \begin{align*}Y\end{align*} scores will be lower than 3?”

The reason why we would ask questions like this depends on the scenario. Suppose, for example, that we want to know the percentage of students with a 5 on their short physical fitness test that have a predicted score higher than 5 on their long physical fitness test. If the coach is using this predicted score as a cutoff for playing in a varsity match, and this percentage is too low, he may want to consider changing the standards of the test.

To find the percentage of students with scores above or below a certain point, we use the concept of standard scores and the standard normal distribution.

Since we have a certain predicted value for every value of \begin{align*}X\end{align*}, the \begin{align*}Y\end{align*} values take on the shape of a normal distribution. This distribution has a mean (the regression line) and a standard error, which we found to be equal to 0.56. In short, the conditional distribution is used to determine the percentage of \begin{align*}Y\end{align*} values above or below a certain value that are associated with a specific value of \begin{align*}X\end{align*}.

#### Calculating Probability

Using our example above, if a student scored a 5 on the short test, what is the probability that he or she would have a score of 5 or greater on the long physical fitness test?

From the regression equation \begin{align*}Y=0.635X+1.22\end{align*}, we find that the predicted score when the value of \begin{align*}X\end{align*} is 5 is 4.40. Consider the conditional distribution of \begin{align*}Y\end{align*} scores when the value of \begin{align*}X\end{align*} is 5. Under our assumption, this distribution is normally distributed around the predicted value 4.40 and has a standard error of 0.56.

Therefore, to find the percentage of \begin{align*}Y\end{align*} scores of 5 or greater, we use the general formula for a \begin{align*}z\end{align*}-score to calculate the following:

\begin{align*}z = \frac{Y-\hat{Y}}{s} = \frac{5-4.40}{0.56} = 1.07\end{align*}

Using the \begin{align*}z\end{align*}-distribution table, we find that the area to the right of a \begin{align*}z\end{align*}-score of 1.07 is 0.1423. Therefore, we can conclude that the proportion of predicted scores of 5 or greater given a score of 5 on the short test is 0.1423, or 14.23%.

#### Prediction **Intervals**

Similar to hypothesis testing for samples and populations, we can also build a confidence interval around our regression results. This helps us ask questions like “If the predictor variable, \begin{align*}X\end{align*}, is equal to a certain value, what are the likely values for \begin{align*}Y\end{align*}?” A confidence interval gives us a range of scores that has a certain percent probability of including the score that we are after.

We know that the standard error of the predicted score is smaller when the predicted value is close to the actual value, and it increases as \begin{align*}X\end{align*} deviates from the mean. This means that the weaker of a predictor that the regression line is, the larger the standard error of the predicted score will be. The formulas for the standard error of a predicted score and a confidence interval are as follows:

\begin{align*}s_{\hat{Y}} & = s \sqrt{1+\frac{1}{n} + \frac{(x-\bar{x})^2}{\sum (x-\bar{x})^2}}\\ CI & = \hat{Y} \pm ts_{\hat{Y}}\end{align*}

where:

\begin{align*}\hat{Y}\end{align*} is the predicted score.

\begin{align*}t\end{align*} is the critical value for \begin{align*}n-2\end{align*} degrees of freedom.

\begin{align*}s_{\hat{Y}}\end{align*} is the standard error of the predicted score.

#### Developing Confidence Intervals

Develop a 95% confidence interval for the predicted score of a student who scores a 4 on the short physical fitness exam.

We calculate the standard error of the predicted score using the formula as follows:

\begin{align*}s_{\hat{Y}} = s\sqrt{1+ \frac{1}{n}+\frac{(x-\bar{x})^2}{\sum(x-\bar{x})^2}} = 0.56 \sqrt{1+ \frac{1}{24} + \frac{(4-4.92)^2}{123.83}} = 0.57\end{align*}

Using the general formula for a confidence interval, we can calculate the answer as shown:

\begin{align*}CI & = \hat{Y} \pm ts_{\hat{Y}} \\ CI_{0.95} & = 3.76 \pm (2.074)(0.57)\\ CI_{0.95} & = 3.76 \pm 1.18\\ CI_{0.95} & = (2.58, 4.94)\end{align*}

Therefore, we can say that we are 95% confident that given a student's short physical fitness test score, \begin{align*}X\end{align*}, of 4, the interval from 2.58 to 4.94 will contain the student's score for the longer physical fitness test.

**Regression** Assumptions

We make several assumptions under a linear regression model, including:

At each value of \begin{align*}X\end{align*}, there is a distribution of \begin{align*}Y\end{align*}. These distributions have a mean centered at the predicted value and a standard error that is calculated using the sum of squares.

Using a regression model to predict scores only works if the regression line is a good fit to the data. If this relationship is non-linear, we could either transform the data (i.e., a logarithmic transformation) or try one of the other regression equations that are available with Excel or a graphing calculator.

The standard deviations and the variances of each of these distributions for each of the predicted values are equal. This is called *homoscedasticity*.

Finally, for each given value of \begin{align*}X\end{align*}, the values of \begin{align*}Y\end{align*} are independent of each other.

### Example

The following example uses data from the previous section:

Verbal SAT scores were used to predict the GPA of students. From the data, we found this least squares regression line:

\begin{align*}\hat{Y}=0.0055X+0.097\end{align*}

We also found that \begin{align*}SSE=0.26\end{align*} for the \begin{align*}n=7\end{align*} samples.

#### Example 1

Suppose a student scores a 650 on the verbal SAT. Assuming the data is normally distributed, what is the probability that they will have a GPA of at least 3.8?

Using the least squares regression line, we will predict the GPA for a verbal SAT score of 650:

\begin{align*}\hat{Y}=0.0055(650)+0.097=3.575+0.097=3.672\end{align*}

This means that an SAT score of 650 predicts that the student will have a GPA of 3.672. They could have a higher, or a lower GPA though, so now we will look at the probability that a student with a GRE score of 650 has a GPA of at least 3.8. First we have to find:

\begin{align*} s &= \sqrt{\frac{SSE}{n-2}}=\sqrt{\frac{0.26}{7-2}}=\sqrt{\frac{0.26}{5}}\approx 0.228\\ \end{align*}

\begin{align*}z = \frac{Y-\hat{Y}}{s} = \frac{3.8-3.672}{0.228} \approx 0.56 \end{align*}

Now we simply look at the \begin{align*}z\end{align*}-table to find the probability of getting a \begin{align*}z\end{align*}-score of 0.56 or higher.

\begin{align*}P(z>0.56)=0.288\end{align*}.

The probability of having a GPA of at least 3.8 when scoring a 650 on the verbal SAT is 0.288.

### Review

For 1-10, a college counselor is putting on a presentation about the financial benefits of further education and takes a random sample of 120 parents. Each parent was asked a number of questions, including the number of years of education that he or she has (including college) and his or her yearly income (recorded in the thousands of dollars). The summary data for this survey are as follows:

\begin{align*}n=120 \quad r=0.67\end{align*}

\begin{align*}\sum x = 1,782 \quad \sum y =1,854 \end{align*}

\begin{align*}s_X=3.6 \quad s_Y=4.2 \end{align*}

\begin{align*}s_{XY}=3.12 \quad SS_X = 1542\end{align*}

- What is the predictor variable? What is your reasoning behind this decision?
- Do you think that these two variables (income and level of formal education) are correlated? Is so, please describe the nature of their relationship.
- What would be the regression equation for predicting income, \begin{align*}Y\end{align*}, from the level of education, \begin{align*}X\end{align*}?
- Using this regression equation, predict the income for a person with 2 years of college (13.5 years of formal education).
- Test the null hypothesis that in the population, the regression coefficient for this scenario is zero.
- First develop the null and alternative hypotheses.
- Set the critical value to \begin{align*}\alpha = 0.05\end{align*}.
- Compute the test statistic.
- Make a decision regarding the null hypothesis.

- For those parents with 15 years of formal education, what is the percentage who will have an annual income greater than $18,500?
- For those parents with 12 years of formal education, what is the percentage who will have an annual income greater than $18,500?
- Develop a 95% confidence interval for the predicted annual income when a parent indicates that he or she has a college degree (i.e., 16 years of formal education).
- If you were the college counselor, what would you say in the presentation to the parents and students about the relationship between further education and salary? Would you encourage students to further their education based on these analyses? Why or why not?
- Using the same null and alternative hypotheses, and test statistics as you did in question 5, make a decision at the significance level of \begin{align*}\alpha = 0.01\end{align*}.

### Review (Answers)

To view the Review answers, open this PDF file and look for section 9.3.