In a previous Concept, we learned about the least-squares model, or the linear regression model. The linear regression model uses the concept of correlation to help us predict the score of a variable based on our knowledge of the score of another variable. In this Concept, we will investigate several inferences and assumptions that we can make about the linear regression model.
For an example on calculating prediction intervals, see MathProfLapuz, Prediction Interval Given Statistics (11:54).
Hypothesis Testing for Linear Relationships
Let’s think for a minute about the relationship between correlation and the linear regression model. As we learned, if there is no correlation between the two variables and , then it would be nearly impossible to fit a meaningful regression line to the points on a scatterplot graph. If there was no correlation, and our correlation value, or -value, was 0, we would always come up with the same predicted value, which would be the mean of all the predicted values, or the mean of . The figure below shows an example of what a regression line fit to variables with no correlation would look like. As you can see, for any value of , we always get the same predicted value of .
Using this knowledge, we can determine that if there is no relationship between and , constructing a regression line doesn’t help us very much, because, again, the predicted score would always be the same. Therefore, when we estimate a linear regression model, we want to ensure that the regression coefficient, , for the population does not equal zero. Furthermore, it is beneficial to test how strong (or far away) from zero the regression coefficient must be to strengthen our prediction of the scores.
In hypothesis testing of linear regression models, the null hypothesis to be tested is that the regression coefficient, , equals zero. Our alternative hypothesis is that our regression coefficient does not equal zero.
The test statistic for this hypothesis test is calculated as follows:
Let’s say that a football coach is using the results from a short physical fitness test to predict the results of a longer, more comprehensive one. He developed the regression equation , and the standard error of estimate is 0.56. The summary statistics are as follows:
Summary statistics for two foot ball fitness tests.
Using , test the null hypothesis that, in the population, the regression coefficient is zero, or .
We use the -distribution to calculate the test statistic and find that the critical values in the -distribution at 22 degrees of freedom are 2.074 standard scores above and below the mean. Also, the test statistic can be calculated as follows:
Since the observed value of the test statistic exceeds the critical value, the null hypothesis would be rejected, and we can conclude that if the null hypothesis were true, we would observe a regression coefficient of 0.635 by chance less than 5% of the time.
Making Inferences about Predicted Scores
As we have mentioned, a regression line makes predictions about variables based on the relationship of the existing data. However, it is important to remember that the regression line simply infers, or estimates, what the value will be. These predictions are never accurate 100% of the time, unless there is a perfect correlation. What this means is that for every predicted value, we have a normal distribution (also known as the conditional distribution , since it is conditional on the value) that describes the likelihood of obtaining other scores that are associated with the value of the predictor variable, .
If we assume that these distributions are normal, we are able to make inferences about each of the predicted scores. We can ask questions like, “If the predictor variable, , equals 4, what percentage of the distribution of scores will be lower than 3?”
The reason why we would ask questions like this depends on the scenario. Suppose, for example, that we want to know the percentage of students with a 5 on their short physical fitness test that have a predicted score higher than 5 on their long physical fitness test. If the coach is using this predicted score as a cutoff for playing in a varsity match, and this percentage is too low, he may want to consider changing the standards of the test.
To find the percentage of students with scores above or below a certain point, we use the concept of standard scores and the standard normal distribution.
Since we have a certain predicted value for every value of , the values take on the shape of a normal distribution. This distribution has a mean (the regression line) and a standard error, which we found to be equal to 0.56. In short, the conditional distribution is used to determine the percentage of values above or below a certain value that are associated with a specific value of .
Using our example above, if a student scored a 5 on the short test, what is the probability that he or she would have a score of 5 or greater on the long physical fitness test?
From the regression equation , we find that the predicted score when the value of is 5 is 4.40. Consider the conditional distribution of scores when the value of is 5. Under our assumption, this distribution is normally distributed around the predicted value 4.40 and has a standard error of 0.56.
Therefore, to find the percentage of scores of 5 or greater, we use the general formula for a -score to calculate the following:
Using the -distribution table, we find that the area to the right of a -score of 1.07 is 0.1423. Therefore, we can conclude that the proportion of predicted scores of 5 or greater given a score of 5 on the short test is 0.1423, or 14.23%.
Similar to hypothesis testing for samples and populations, we can also build a confidence interval around our regression results. This helps us ask questions like “If the predictor variable, , is equal to a certain value, what are the likely values for ?” A confidence interval gives us a range of scores that has a certain percent probability of including the score that we are after.
We know that the standard error of the predicted score is smaller when the predicted value is close to the actual value, and it increases as deviates from the mean. This means that the weaker of a predictor that the regression line is, the larger the standard error of the predicted score will be. The formulas for the standard error of a predicted score and a confidence interval are as follows:
is the predicted score.
is the critical value for degrees of freedom.
is the standard error of the predicted score.
Develop a 95% confidence interval for the predicted score of a student who scores a 4 on the short physical fitness exam.
We calculate the standard error of the predicted score using the formula as follows:
Using the general formula for a confidence interval, we can calculate the answer as shown:
Therefore, we can say that we are 95% confident that given a student's short physical fitness test score, , of 4, the interval from 2.58 to 4.94 will contain the student's score for the longer physical fitness test.
We make several assumptions under a linear regression model, including:
At each value of , there is a distribution of . These distributions have a mean centered at the predicted value and a standard error that is calculated using the sum of squares.
Using a regression model to predict scores only works if the regression line is a good fit to the data. If this relationship is non-linear, we could either transform the data (i.e., a logarithmic transformation) or try one of the other regression equations that are available with Excel or a graphing calculator.
The standard deviations and the variances of each of these distributions for each of the predicted values are equal. This is called homoscedasticity .
Finally, for each given value of , the values of are independent of each other.
When we estimate a linear regression model , we want to ensure that the regression coefficient for the population, , does not equal zero. To do this, we perform a hypothesis test, where we set the regression coefficient equal to zero and test for significance.
For each predicted value, we have a normal distribution (also known as the conditional distribution, since it is conditional on the value of ) that describes the likelihood of obtaining other scores that are associated with the value of the predictor variable, . We can use these distributions and the concept of standardized scores to make predictions about probability.
We can also build confidence intervals around the predicted values to give us a better idea about the ranges likely to contain a certain score.
We make several assumptions when dealing with a linear regression model including:
At each value of , there is a distribution of .
A regression line is a good fit to the data. There is homoscedasticity , and the observations are independent.
Recall the example in the last Concept, where the verbal SAT scores were used to predict the GPA of students. From the data, we found this least squares regression line:
We also found in the previous Concept that for the samples.
Suppose a student scores a 650 on the verbal SAT. Assuming the data is normally distributed, what is the probability that they will have a GPA of at least 3.8?
Using the least squares regression line, we will predict the GPA for a verbal SAT score of 650:
This means that an SAT score of 650 predicts that the student will have a GPA of 3.672. They could have a higher, or a lower GPA though, so now we will look at the probability that a student with a GRE score of 650 has a GPA of at least 3.8. First we have to find:
Now we simply look at the -table to find the probability of getting a -score of 0.56 or higher.
The probability of having a GPA of at least 3.8 when scoring a 650 on the verbal SAT is 0.288.
For 1-10, a college counselor is putting on a presentation about the financial benefits of further education and takes a random sample of 120 parents. Each parent was asked a number of questions, including the number of years of education that he or she has (including college) and his or her yearly income (recorded in the thousands of dollars). The summary data for this survey are as follows:
- What is the predictor variable? What is your reasoning behind this decision?
- Do you think that these two variables (income and level of formal education) are correlated? Is so, please describe the nature of their relationship.
- What would be the regression equation for predicting income, , from the level of education, ?
- Using this regression equation, predict the income for a person with 2 years of college (13.5 years of formal education).
Test the null hypothesis that in the population, the regression coefficient for this scenario is zero.
- First develop the null and alternative hypotheses.
- Set the critical value to .
- Compute the test statistic.
- Make a decision regarding the null hypothesis.
- For those parents with 15 years of formal education, what is the percentage who will have an annual income greater than $18,500?
- For those parents with 12 years of formal education, what is the percentage who will have an annual income greater than $18,500?
- Develop a 95% confidence interval for the predicted annual income when a parent indicates that he or she has a college degree (i.e., 16 years of formal education).
- If you were the college counselor, what would you say in the presentation to the parents and students about the relationship between further education and salary? Would you encourage students to further their education based on these analyses? Why or why not?
- Using the same null and alternative hypotheses, and test statistics as you did in question 5, make a decision at the significance level of .