9.3: Inferences about Regression
Learning Objectives
- Make inferences about the regression models including hypothesis testing for linear relationships.
- Make inferences about regression and predicted values including the construction of confidence intervals.
- Check regression assumptions.
Introduction
In the previous section, we learned about the least-squares or the linear regression model. The linear regression model uses the concept of correlation to help us predict a variable based on our knowledge of scores on another variable. As we learned in the previous section, this concept is used quite frequently in statistical analysis to predict variables such as IQ, test performance, etc. In this section, we will investigate several inferences and assumptions that we can make about the linear regression model.
Hypothesis Testing for Linear Relationships
Let’s think for a minute about the relationship between correlation and the linear regression model. As we learned, if there is no correlation between two variables (\begin{align*}X\end{align*}
Using this knowledge, we can determine that if there is no relationship between \begin{align*}Y\end{align*}
In hypothesis testing of linear regression models, the null hypothesis to be tested is that the regression coefficient \begin{align*}(\beta)\end{align*}
\begin{align*}H_0&:(\beta)=0\\
H_a&:(\beta)\ne 0\end{align*}
We perform this hypothesis test similar to the previous conducted hypothesis test and need to next establish the critical values for the hypothesis test. We use the \begin{align*}t\end{align*}
\begin{align*}t = \frac{\text{observed value} - \text{hypothesized or predicted value}} {\text{Standard Error of the statistic}} = \frac{b - \beta} {s_b}\end{align*}
To calculate the test statistic for this regression coefficient, we also need to estimate the sampling distributions of the regression coefficients. This statistic about this distribution that we will use is the standard error of the regression coefficient \begin{align*}(s_b)\end{align*}
\begin{align*}S_b = \left (\frac{s_{y * x}} {\sqrt{SS_x}} \right)\end{align*}
where:
\begin{align*}s_{y * x} =\end{align*} the standard error of estimate
\begin{align*}SS_x =\end{align*} the sum of squares for the predictor variable \begin{align*}(X)\end{align*}
Example:
Let’s say that the football coach is using the results from a short physical fitness test to predict the results of a longer, more comprehensive one. He developed the regression equation of \begin{align*}Y = .635X+ 1.22\end{align*} and the standard error of estimate \begin{align*}s_{Y*x} = .56\end{align*}. The summary statistics are as follows:
\begin{align*}\mathbf{Summary statistics for two football fitness tests.} \\ & n=24 && \sum XY=591.50\\ & \sum X=118 & & \sum Y=104.3\\ & \bar{X}=4.92 & & \bar{Y}=4.35\\ & \sum X^2 = 704 & & \sum Y^2 =510.01\\ & SS_x =123.83 & & SS_y =56.74 \end{align*}
Using a \begin{align*}\alpha =.05\end{align*}, test the null hypothesis that, in the population, the regression coefficient is zero \begin{align*}(H_0: \beta = 0)\end{align*}.
Solution:
We use the \begin{align*}t\end{align*}-distribution for this test statistic and find that the critical values in the \begin{align*}t\end{align*}-distribution at \begin{align*}22 \;\mathrm{degrees}\end{align*} of freedom \begin{align*}(n-2)\end{align*} are \begin{align*}2.074\end{align*} standard scores above and below the mean. Therefore,
\begin{align*}S_b & = \left (\frac{s_{y * x}} {\sqrt{SS_x}} \right) = \left (\frac{.56} {\sqrt{123.83}} \right) = 0.05\\ t & = \frac{b - \beta} {s_b} = \frac{0.635 - 0} {0.05} = 12.70\end{align*}
Since the observed value of the test statistic exceeds the critical value, the null hypothesis would be rejected and we can conclude that if the null hypothesis was true, we would observe a regression coefficient of \begin{align*}0.635\end{align*} by chance less than \begin{align*}5\%\end{align*} of the time.
Making Inferences about Predicted Scores
As we have mentioned, the regression line simply makes predictions about variables based on the relationship of the existing data. However, it is important to remember that the regression line simply infers or estimates what the value will be. These predictions are never accurate \begin{align*}100\%\end{align*} of the time unless there is a perfect correlation. What this means is that for every predicted value, we have a normal distribution (also known as the conditional distribution since it is conditional on the \begin{align*}X\end{align*} value) that describes the likelihood of obtaining other scores that are associated with the value of the predicted variable \begin{align*}(X)\end{align*}.
If we assume that these distributions are normal, we are able to make inferences about each of the predicted scores. One example of making inferences about the predicted scores is identifying probability levels associated with predicted scores. Using this concept, we are able to ask questions such as “If the predictor variable (\begin{align*}X\end{align*} value) equals \begin{align*}4.0\end{align*}, what percentage of the distribution of \begin{align*}Y\end{align*} scores will be lower than \begin{align*}3\end{align*}?”
The reason that we would ask questions like this depends on the scenario. Say, for example, that we want to know the percentage of students with a \begin{align*}4\end{align*} on their short physical fitness test that have predicted scores higher than \begin{align*}5\end{align*}. If the coach is using this predicted score as a cutoff for playing in a varsity match and this percentage is too low, he may want to consider changing the standards of the test.
To find the percentage of students with scores above or below a certain point, we use the concept of standard scores and the standard normal distribution. Remember the general formula for calculating the standard score:
\begin{align*}\text{Test Statistic} = \frac{\text{Observed Statistic} - \text{Population Mean}} {\text{Standard error}}\end{align*}
Applying this formula to the regression distribution, we find that the corresponding formula would be:
\begin{align*}z = \frac{Y - \hat{Y}} {s_{XY}}\end{align*}
Since we have a certain predicted value for every value of \begin{align*}X\end{align*}, the \begin{align*}Y\end{align*} values take on the shape of a normal distribution. This distribution has a mean (the regression line) and a standard error which we found to be equal to \begin{align*}0.56\end{align*}. In short, the conditional distribution is used to determine the percentage of \begin{align*}Y\end{align*} values that are associated with a specific value of \begin{align*}X\end{align*}.
Example:
Using our example above, if a student scored a \begin{align*}5\end{align*} on the short test, what is the probability that they would have a score of \begin{align*}5\end{align*} or greater on the long physical fitness test?
Solution:
From the regression equation \begin{align*}Y = .635X+1.22\end{align*}, we find that the predicted score for \begin{align*}X=5\end{align*} is \begin{align*}Y=4.40\end{align*}. Consider the conditional distribution of \begin{align*}Y\end{align*} scores for \begin{align*}X=5\end{align*}. Under our assumption, this distribution is normally distributed around the predicted value \begin{align*}(4.40)\end{align*} and has a standard error of \begin{align*}0.56\end{align*}.
Therefore, to find the percentage of \begin{align*}Y\end{align*} scores of \begin{align*}5\end{align*} or greater, we use the general formula and find that:
\begin{align*}z = \frac{Y - \hat{Y}} {s_{Y * X}} = \frac{5 - 4.40} {0.56} = 1.07\end{align*}
Using the \begin{align*}z\end{align*}-distribution table, we find that the area to the right of a \begin{align*}z\end{align*} score of \begin{align*}1.07\end{align*} is \begin{align*}.1423\end{align*}. Therefore, we can conclude that the proportion of predicted scores of \begin{align*}5\end{align*} or greater given a predicted score of \begin{align*}5\end{align*} is \begin{align*}.1423\end{align*} or \begin{align*}14.23\%\end{align*}.
Confidence Intervals
Similar to hypothesis testing for samples and populations, we can also build a confidence interval around our regression results. This helps us ask questions like “If the predictor value was equal to \begin{align*}X\end{align*}, what are the likely values for \begin{align*}Y\end{align*}?” This gives us a range of scores that has a certain percent probability of including the score that we are after.
We know that the standard error of the predicted score is smaller when the predicted value is close to the actual value and it increases as \begin{align*}X\end{align*} deviates from the mean. This means that the weaker of a predictor that the regression line is, the larger the standard error of the predicted score will be. The standard error of a predicted score is calculated by using the formula:
\begin{align*}s_{\hat{Y}} = s_{Y * X} \sqrt{1 + \frac{1} {n} + \frac{(X - \bar{X})^2} {SS_x}}\end{align*}
The general formula for the confidence interval for predicted scores is found by using the following formula:
\begin{align*}CI = \hat{Y} \underline \pm (t_{cv} s_Y)\end{align*}
where:
\begin{align*}\hat{Y} =\end{align*} the predicted score
\begin{align*}t_{cv} =\end{align*} critical value of \begin{align*}t\end{align*} for \begin{align*}df(n-2)\end{align*}
\begin{align*}s_Y =\end{align*} standard error of the predicted score
Example:
Develop a \begin{align*}95\%\end{align*} confidence interval for the predicted scores from a student that scores a \begin{align*}4\end{align*} on the short physical fitness exam \begin{align*}(X=4)\end{align*}.
Solution:
We calculate the standard error of the predicted value using the formula:
\begin{align*}s_{\hat{Y}} = s_{Y * X} \sqrt{1 + \frac{1} {n} + \frac{(X - \bar{X})^2} {SS_x}} = 0.56 \sqrt{1 + \frac{1} {24} + \frac{(4 - 4.92)^2} {123.83}} = 0.57\end{align*}
Using the general formula for the confidence interval, we find that
\begin{align*}CI & = \hat{Y} \underline \pm (t_{cv} s_Y)\\ CI_{95} & = 3.76 \underline \pm (2.074) (0.57)\\ CI_{95} & = 3.76 \underline \pm 1.18\\ CI_{95} & = (2.58, 4.94)\\ 2.58 & < CI_{95} < 4.94)\end{align*}
Therefore, we can say that we are \begin{align*}95\%\end{align*} confident that given a students’ short physical fitness test score \begin{align*}(X)\end{align*} of 4, the interval from \begin{align*}2.58\end{align*} to \begin{align*}4.94\end{align*} will contain the students’ score for the longer physical fitness test.
Regression Assumptions
We make several assumptions under a linear regression model including:
- At each value of \begin{align*}X\end{align*}, there is a distribution of \begin{align*}Y\end{align*}. These distributions have a mean centered around the predicted value and a standard error that is calculated using the sum of squares.
- The best regression model is a straight line. Using a regression model to predict scores only works if the regression line is a straight line. If this relationship is non linear, we could either transform the data (i.e., a logarithmic transformation) or try one of the other regression equations that are available with Excel or a graphing calculator.
- Homoscedasticity. The standard deviations, or the variances, of each of these distributions for each of the predicted values is equal.
- Independence of observation. For each give value of \begin{align*}X\end{align*}, the values of \begin{align*}Y\end{align*} are independent of each other.
Lesson Summary
- When we estimate a linear regression model, we want to ensure that the regression coefficient in the population \begin{align*}(\beta)\end{align*} does not equal zero. To do this, we perform a hypothesis test where we set the regression coefficient equal to zero and test for significance.
- For each predicted value, we have a normal distribution (also known as the conditional distribution since it is conditional on the \begin{align*}X\end{align*} value) that describes the likelihood of obtaining other scores that are associated with the value of the predicted variable \begin{align*}(X)\end{align*}. We can use these distributions and the concept of standardized scores to make predictions about probability.
- We can also build confidence intervals around the predicted values to give us a better idea about the ranges likely to contain a certain score.
- We make several assumptions when dealing with a linear regression model including:
- At each value of \begin{align*}X\end{align*}, there is a distribution of \begin{align*}Y\end{align*}
- The regression model is a straight line
- Homoscedasticity
- Independence of observations
Review Questions
The college counselor is putting on a presentation about the financial benefits of further education and takes a random sample of \begin{align*}120\end{align*} parents. Each parent was asked a number of questions including the number of years of education that they have (including college) and their yearly income (recorded in the thousands). The summary data for this survey are as follows:
\begin{align*}n = 120 & & r = 0.67 & & \sum X = 1,782 & & \sum Y = 1,854 & & s_x = 3.6 & & s_Y = 4.2 & & SS_x=1542\end{align*}
- What is the predictor variable? What is your reasoning behind this decision?
- Do you think that these two variables (income and level of formal education) are correlated? Is so, please describe the nature of their relationship.
- What would be the regression equation for predicting income \begin{align*}(Y)\end{align*} from the level of education \begin{align*}(X)\end{align*}?
- Using this regression equation, predict the income for a person with \begin{align*}2\;\mathrm{years}\end{align*} of college (\begin{align*}13.5 \;\mathrm{years}\end{align*} of formal education).
- Test the null hypothesis that in the population, the regression coefficient for this scenario is zero.
- First develop the null and alternative hypotheses.
- Set the critical values at \begin{align*}\alpha =.05\end{align*}.
- Compute the test statistic.
- Make a decision regarding the null hypothesis.
- For those parents with \begin{align*}15\;\mathrm{years}\end{align*} of formal education, what is the percentage that will have an annual income greater than \begin{align*}18,500\end{align*}?
- For those parents with \begin{align*}12\;\mathrm{years}\end{align*} of formal education, what is the percentage that will have an annual income greater than \begin{align*}18,500\end{align*}?
- Develop a \begin{align*}95\%\end{align*} confidence interval for a predicted annual income when a parent indicates that they have a college degree (i.e. - \begin{align*}16 \;\mathrm{years}\end{align*} of formal education).
- If you were the college counselor, what would you say in the presentation to the parents and students about the relationship between further education and salary? Would you encourage students to further their education based on these analyses? Why or why not?
Review Answers
- The predictor variable is the number of years of formal education. The reasoning behind this decision is that we are trying to determine and predict the financial benefits of further education (as measured by annual salary) by using the number of years of formal education (the predictor, or the \begin{align*}X\end{align*}, variable.
- Yes. With an \begin{align*}r\end{align*}-value of \begin{align*}0.67\end{align*}, these two variables appear to be moderately to strongly correlated. The nature of the relationship is a relatively strong, positive correlation.
- \begin{align*}Y = 0.782X+ 3.842\end{align*}
- For \begin{align*}X=13.5, Y = 14.39\end{align*} or \begin{align*}\$14,390\end{align*}
- (a) \begin{align*}H_0: \beta = 0, H_a: \beta \ne 0\end{align*} (b) The critical values are set at \begin{align*}t = \underline \pm 1.98\end{align*} (c) \begin{align*}S_b = \left (\frac{s_{y*x}} {\sqrt{SS_x}} \right) = \left (\frac{3.12} {\sqrt{1542}} \right) = .08 , t = \frac{b - \beta} {s_b} = \frac{0.792 - 0} {.08} = 9.9\end{align*} (d) Since the calculated test statistic of \begin{align*}9.9\end{align*} exceeds the critical value of \begin{align*}1.98\end{align*}, we decide to reject the null hypothesis and can conclude that the if the null hypothesis was true, we would observe a regression coefficient of \begin{align*}0.792\end{align*} by chance less than \begin{align*}5\%\end{align*} of the time.
- For \begin{align*}X=15\end{align*}, \begin{align*}\hat{Y}=15.57\end{align*}. Therefore, \begin{align*}18.50\end{align*} has a \begin{align*}z\end{align*}-value of \begin{align*}0.93\end{align*}: \begin{align*}z = \frac{Y - \ddot {Y}} {s_{Y * X}} = \frac{18.5 - 15.57} {3.12} = 0.93\end{align*} The \begin{align*}z\end{align*}-value of \begin{align*}0.936\end{align*} has a corresponding \begin{align*}p\end{align*}-value of \begin{align*}.1677\end{align*}. This means that with \begin{align*}15\;\mathrm{years}\end{align*} of formal education, an estimated \begin{align*}16.77\%\end{align*} of the parents will have an income greater than \begin{align*}\$18,500\end{align*}
- For \begin{align*}X=12\end{align*}, \begin{align*}\hat{Y}=13.2\end{align*}. Therefore, \begin{align*}18.50\end{align*} has a \begin{align*}z\end{align*} -value of \begin{align*}0.93\end{align*}: \begin{align*}z = \frac{Y - \ddot{Y}} {s_{Y * X}} = \frac{18.5 - 13.25} {3.12} = 1.68\end{align*} The \begin{align*}z\end{align*}-value of \begin{align*}0.936\end{align*} has a corresponding \begin{align*}p\end{align*}-value of \begin{align*}.0465\end{align*}. This means that with \begin{align*}15\;\mathrm{years}\end{align*} of formal education, an estimated \begin{align*}4.65\%\end{align*} of the parents will have an income greater than \begin{align*}\$18,500\end{align*}
- \begin{align*}s_{\hat{Y}} = s_{Y*X} \sqrt{1 + \frac{1} {n} + \frac{(X - \hat{X})^2} {SS_x}} = 3.12 \sqrt{1 + \frac{1} {120} + \frac{(16 - 14.85)^2} {1542}} = 3.14\end{align*}
Using the general formula for the confidence interval \begin{align*}(CI = \hat{Y} \underline \pm (t_{cv} s_Y))\end{align*}, we find that
\begin{align*}CI_{95} & = 16.35 \underline \pm (1.98) (3.14) = 16.35 \underline \pm 6.22\\ CI_95 & = (10.13, 22.57)\end{align*}
- Answer is to the discretion of the teacher.