### Inferences and Assumptions about Linear Regression

**Hypothesis Testing for Linear** Relationships

Let’s think for a minute about the relationship between correlation and the linear regression model. As we learned, if there is no correlation between the two variables

Using this knowledge, we can determine that if there is no relationship between

In hypothesis testing of linear regression models, the null hypothesis to be tested is that the regression coefficient,

The test statistic for this hypothesis test is calculated as follows:

#### Testing the Accuracy of a Regression Equation

Let’s say that a football coach is using the results from a short physical fitness test to predict the results of a longer, more comprehensive one. He developed the regression equation

Summary statistics for two foot ball fitness tests.

Using

We use the

Since the observed value of the test statistic exceeds the critical value, the null hypothesis would be rejected, and we can conclude that if the null hypothesis were true, we would observe a regression coefficient of 0.635 by chance less than 5% of the time.

**Making Inferences about** Predicted **Scores**

As we have mentioned, a regression line makes predictions about variables based on the relationship of the existing data. However, it is important to remember that the regression line simply infers, or estimates, what the value will be. These predictions are never accurate 100% of the time, unless there is a perfect correlation. What this means is that for every predicted value, we have a normal distribution (also known as the **conditional distribution**, since it is conditional on the

If we assume that these distributions are normal, we are able to make inferences about each of the predicted scores. We can ask questions like, “If the predictor variable,

The reason why we would ask questions like this depends on the scenario. Suppose, for example, that we want to know the percentage of students with a 5 on their short physical fitness test that have a predicted score higher than 5 on their long physical fitness test. If the coach is using this predicted score as a cutoff for playing in a varsity match, and this percentage is too low, he may want to consider changing the standards of the test.

To find the percentage of students with scores above or below a certain point, we use the concept of standard scores and the standard normal distribution.

Since we have a certain predicted value for every value of

#### Calculating Probability

Using our example above, if a student scored a 5 on the short test, what is the probability that he or she would have a score of 5 or greater on the long physical fitness test?

From the regression equation

Therefore, to find the percentage of

Using the

#### Prediction **Intervals**

Similar to hypothesis testing for samples and populations, we can also build a confidence interval around our regression results. This helps us ask questions like “If the predictor variable,

We know that the standard error of the predicted score is smaller when the predicted value is close to the actual value, and it increases as

where:

#### Developing Confidence Intervals

Develop a 95% confidence interval for the predicted score of a student who scores a 4 on the short physical fitness exam.

We calculate the standard error of the predicted score using the formula as follows:

Using the general formula for a confidence interval, we can calculate the answer as shown:

Therefore, we can say that we are 95% confident that given a student's short physical fitness test score,

**Regression** Assumptions

We make several assumptions under a linear regression model, including:

At each value of

Using a regression model to predict scores only works if the regression line is a good fit to the data. If this relationship is non-linear, we could either transform the data (i.e., a logarithmic transformation) or try one of the other regression equations that are available with Excel or a graphing calculator.

The standard deviations and the variances of each of these distributions for each of the predicted values are equal. This is called *homoscedasticity*.

Finally, for each given value of \begin{align*}X\end{align*}, the values of \begin{align*}Y\end{align*} are independent of each other.

### Example

The following example uses data from the previous section:

Verbal SAT scores were used to predict the GPA of students. From the data, we found this least squares regression line:

\begin{align*}\hat{Y}=0.0055X+0.097\end{align*}

We also found that \begin{align*}SSE=0.26\end{align*} for the \begin{align*}n=7\end{align*} samples.

#### Example 1

Suppose a student scores a 650 on the verbal SAT. Assuming the data is normally distributed, what is the probability that they will have a GPA of at least 3.8?

Using the least squares regression line, we will predict the GPA for a verbal SAT score of 650:

\begin{align*}\hat{Y}=0.0055(650)+0.097=3.575+0.097=3.672\end{align*}

This means that an SAT score of 650 predicts that the student will have a GPA of 3.672. They could have a higher, or a lower GPA though, so now we will look at the probability that a student with a GRE score of 650 has a GPA of at least 3.8. First we have to find:

\begin{align*} s &= \sqrt{\frac{SSE}{n-2}}=\sqrt{\frac{0.26}{7-2}}=\sqrt{\frac{0.26}{5}}\approx 0.228\\ \end{align*}

\begin{align*}z = \frac{Y-\hat{Y}}{s} = \frac{3.8-3.672}{0.228} \approx 0.56 \end{align*}

Now we simply look at the \begin{align*}z\end{align*}-table to find the probability of getting a \begin{align*}z\end{align*}-score of 0.56 or higher.

\begin{align*}P(z>0.56)=0.288\end{align*}.

The probability of having a GPA of at least 3.8 when scoring a 650 on the verbal SAT is 0.288.

### Review

For 1-10, a college counselor is putting on a presentation about the financial benefits of further education and takes a random sample of 120 parents. Each parent was asked a number of questions, including the number of years of education that he or she has (including college) and his or her yearly income (recorded in the thousands of dollars). The summary data for this survey are as follows:

\begin{align*}n=120 \quad r=0.67\end{align*}

\begin{align*}\sum x = 1,782 \quad \sum y =1,854 \end{align*}

\begin{align*}s_X=3.6 \quad s_Y=4.2 \end{align*}

\begin{align*}s_{XY}=3.12 \quad SS_X = 1542\end{align*}

- What is the predictor variable? What is your reasoning behind this decision?
- Do you think that these two variables (income and level of formal education) are correlated? Is so, please describe the nature of their relationship.
- What would be the regression equation for predicting income, \begin{align*}Y\end{align*}, from the level of education, \begin{align*}X\end{align*}?
- Using this regression equation, predict the income for a person with 2 years of college (13.5 years of formal education).
- Test the null hypothesis that in the population, the regression coefficient for this scenario is zero.
- First develop the null and alternative hypotheses.
- Set the critical value to \begin{align*}\alpha = 0.05\end{align*}.
- Compute the test statistic.
- Make a decision regarding the null hypothesis.

- For those parents with 15 years of formal education, what is the percentage who will have an annual income greater than $18,500?
- For those parents with 12 years of formal education, what is the percentage who will have an annual income greater than $18,500?
- Develop a 95% confidence interval for the predicted annual income when a parent indicates that he or she has a college degree (i.e., 16 years of formal education).
- If you were the college counselor, what would you say in the presentation to the parents and students about the relationship between further education and salary? Would you encourage students to further their education based on these analyses? Why or why not?
- Using the same null and alternative hypotheses, and test statistics as you did in question 5, make a decision at the significance level of \begin{align*}\alpha = 0.01\end{align*}.

### Review (Answers)

To view the Review answers, open this PDF file and look for section 9.3.