9.4: Multiple Regression
Learning Objectives
- Understand the multiple regression equation and the coefficients of determination for correlation of three or more variables.
- Calculate the multiple regression equation using technological tools.
- Calculate the standard error of a coefficient, test a coefficient for significance to evaluate a hypothesis and calculate the confidence interval for a coefficient using technological tools.
Introduction
In the previous sections, we learned a bit about examining the relationship between two variables by calculating the correlation coefficient and the linear regression line. But, as we all know, often times we work with more than two variables. For example, what happens if we want to examine the impact that class size and number of faculty members has on a university ranking. Since we are taking multiple variables into account, the linear regression model just won’t work. In multiple linear regression scores for one variable are predicted (in this example, university ranking) using multiple predictor variables (class size and number of faculty members).
Another common use of the multiple regression model is in the estimation of the selling price of a home. There are a number of variables that go into determining how much a particular house will cost including the square footage, the number of bedrooms, the number of bathrooms, the age of the house, the neighborhood, etc. Analysts use multiple regression to estimate the selling price in relation to all of these different types of variables.
In this section, we will examine the components of the multiple regression equation, calculate the equation using technological tools and use this equation to test for significance to evaluate a hypothesis.
Understanding the Multiple Regression Equation
If we were to try to draw a multiple regression model, it would be a bit more difficult than drawing the model for linear regression. Let’s say that we have two predictor variables (\begin{align*}X_1\end{align*} and \begin{align*}X_2\end{align*}) that are predicting the desired variable \begin{align*}(Y)\end{align*}. The regression equation would be:
\begin{align*}\ddot{Y} = b_1 X_1 + b_2 X_2 + a\end{align*}
Since there are three variables, each would have three scores and therefore these scores would be plotted in three dimensions (see figure below). When there are more than two predictor variables, we would continue to plot these in multiple dimensions. Regardless of how many predictor variables that we have, we still use the least squares method to try to reduce the distance between the actual and predicted values.
When predicting values using multiple regression, we can also use the standard score form of the formula:
\begin{align*}z_{\hat{Y}} = \beta_1 z_1 \beta_2 z_2 + \text{etc}\ \ldots\end{align*}
where:
\begin{align*}z_{\hat{Y}} =\end{align*} the predicted or criterion variable
\begin{align*}\beta =\end{align*} the regression coefficient
\begin{align*}z =\end{align*} the predictor variable
To solve for the regression and constant coefficients, we first need to determine the multiple correlation coefficient \begin{align*}(r)\end{align*} and coefficient of determination, also known as the proportion of shared variance \begin{align*}(R^2)\end{align*}. In a linear regression model, we measured \begin{align*}R^2\end{align*} by adding the sum of the distances from the actual to the points predicted by the regression line. So what does \begin{align*}R^2\end{align*} look like in a multiple regression model? Let’s take a look at the figure above. Essentially, like the linear regression model, the theory behind the computation of the multiple regression equation is to minimize the sum of the squared deviations from the observation to the regression plane.
In most situations, we use the computer to calculate the multiple regression equation and determine the coefficients in this equation. We can also do multiple regression on a TI83/84 calculator (this program can be downloaded from http://www.wku.edu/~david.neal/manual/ti83.html). However, it is helpful to explain the calculations that go into the multiple regression equation so we can get a better understanding of how this formula works.
After we find the correlation values \begin{align*}(r)\end{align*} between the variables, we can use the following formulas to determine the regression coefficients for each of the predictor \begin{align*}(X)\end{align*} variables:
\begin{align*}\beta_1 & = \frac{r_{Y1} - (r_{Y2}) (r_{12})} {1 - r^2_{12}}\\ \beta_2 & = \frac{r_{Y2} - (r_{Y1}) (r_{12})} {1 - r^2_{12}}\end{align*}
where:
\begin{align*}\beta_1 =\end{align*} the correlation coefficient
\begin{align*}r_{Y1} =\end{align*} correlation between the criterion variables \begin{align*}(Y)\end{align*} and the first predictor variable \begin{align*}(X_1)\end{align*}
\begin{align*}r_{Y2} =\end{align*} correlation between the criterion variables \begin{align*}(Y)\end{align*} and the second predictor variable \begin{align*}(X_2)\end{align*}
\begin{align*}r_{12} =\end{align*} correlation between the two predictor variables
After solving for the beta coefficients, we can compute for the \begin{align*}b\end{align*} coefficients using the following formulas:
\begin{align*}b_1 & = \beta_1 \left (\frac{s_Y} {s_1} \right)\\ b_2 & = \beta_2 \left (\frac{s_Y} {s_2} \right)\end{align*}
where:
\begin{align*}s_Y =\end{align*} the standard deviation of the criterion variable \begin{align*}(Y)\end{align*}
\begin{align*}S_1 =\end{align*} the standard deviation of the particular predictor variable (\begin{align*}1\end{align*} for the first predictor variable and so forth)
After solving for the regression coefficients, we can finally solve for the regression constant by using the formula:
\begin{align*}a = \bar{Y} - \sum^k_{i = 1} b_i \bar{X}_i\end{align*}
Again, since these formulas and calculations are extremely tedious to complete by hand, we use the computer or TI-83 calculator to solve for the coefficients in the multiple regression equation.
Calculating the Multiple Regression Equation using Technological Tools
As mentioned, there are a variety of technological tools to calculate the coefficients in the multiple regression equation. When using the computer, there are several programs that help us calculate the multiple regression equation including Microsoft Excel, the Statistical Analysis Software (SAS) and the Statistical Package for the Social Sciences (SPSS) software. Each of these programs allows the user to calculate the multiple regression equation and provides summary statistics for each of the models.
For the purposes of this lesson, we will synthesize summary tables produced by Microsoft Excel to solve problems with multiple regression equations. While the summary tables produced by the different technological tools differ slightly in the format, they all provide us with the information needed to build a multiple regression model, conduct hypothesis tests and construct confidence intervals. Let’s take a look at an example of a summary statistics table so we get a better idea of how we can use technological tools to build multiple regression models.
Example:
Let’s say that we want to predict the amount of water consumed by football players during summer practices. The football coach notices that the water consumption tends to be influenced by the time that the players are on the field and the temperature. He measures the average water consumption, temperature and practice time for seven practices and records the following data:
Temperature \begin{align*}(F)\end{align*} | Practice Time (Hrs) | \begin{align*}H2O\end{align*} Consumption (in ounces) |
---|---|---|
\begin{align*}75\end{align*} | \begin{align*}1.85\end{align*} | \begin{align*}16\end{align*} |
\begin{align*}83\end{align*} | \begin{align*}1.25\end{align*} | \begin{align*}20\end{align*} |
\begin{align*}85\end{align*} | \begin{align*}1.5\end{align*} | \begin{align*}25\end{align*} |
\begin{align*}85\end{align*} | \begin{align*}1.75\end{align*} | \begin{align*}27\end{align*} |
\begin{align*}92\end{align*} | \begin{align*}1.15\end{align*} | \begin{align*}32\end{align*} |
\begin{align*}97\end{align*} | \begin{align*}1.75\end{align*} | \begin{align*}48\end{align*} |
\begin{align*}99\end{align*} | \begin{align*}1.6\end{align*} | \begin{align*}48\end{align*} |
Figure: Water consumption by football players compared to practice time and temperature.
Here is the procedure for performing a multiple regression in Excel using this set of data.
- Copy and paste the table into an empty Excel worksheet
- Select Data Analysis from the Tools menu and choose “Regression” from the list that appears
- Place the cursor in the “Input \begin{align*}Y\end{align*} range” field and select the third column.
- Place the cursor in the “Input \begin{align*}X\end{align*} range” field and select the first and second columns
- Place the cursor in the “Output Range” and click somewhere in a blank cell below and to the left of the table.
- Click “Labels” so that the names of the predictor variables will be displayed in the table
- Click OK and the results shown below will be displayed.
SUMMARY OUTPUT
Regression Statistics
\begin{align*}& \text{Multiple R} && 0.996822 \\ & \text{R Square } && 0.993654 \\ & \text{Adjusted R Square} && 0.990481 \\ & \text{Standard Error} && 1.244877\\ & \text{Observations} && 7\end{align*}
\begin{align*}df\end{align*} | \begin{align*}SS\end{align*} | \begin{align*}MS\end{align*} | \begin{align*}F\end{align*} | Significance \begin{align*}F\end{align*} | ||
---|---|---|---|---|---|---|
Regression | \begin{align*}2\end{align*} | \begin{align*}970.6583\end{align*} | \begin{align*}485.3291\end{align*} | \begin{align*}313.1723\end{align*} | \begin{align*}4.03E-05\end{align*} | |
Residual | \begin{align*}4\end{align*} | \begin{align*}6.198878\end{align*} | \begin{align*}1.549719\end{align*} | |||
Total | \begin{align*}6\end{align*} | \begin{align*}976.8571\end{align*} |
Coefficients | Standard Error | \begin{align*}t\end{align*} Stat | \begin{align*}P-\end{align*}value | Lower \begin{align*}95\%\end{align*} | Upper \begin{align*}95\%\end{align*} | |
---|---|---|---|---|---|---|
Intercept | \begin{align*}-121.655\end{align*} | \begin{align*}6.540348\end{align*} | \begin{align*}-18.6007\end{align*} | \begin{align*}4.92E-05\end{align*} | \begin{align*}-139.814\end{align*} | \begin{align*}-103.496\end{align*} |
Temperature | \begin{align*}1.512364\end{align*} | \begin{align*}0.060771\end{align*} | \begin{align*}24.88626\end{align*} | \begin{align*}1.55E-05\end{align*} | \begin{align*}1.343636\end{align*} | \begin{align*}1.681092\end{align*} |
Practice Time | \begin{align*}12.53168\end{align*} | \begin{align*}1.93302\end{align*} | \begin{align*}6.482954\end{align*} | \begin{align*}0.002918\end{align*} | \begin{align*}7.164746\end{align*} | \begin{align*}17.89862\end{align*} |
Remember, we can also use the TI-83/84 calculator to perform multiple regression analysis. The program for this analysis can be downloaded at http://www.wku.edu/~david.neal/manual/ti83.html.
In this excerpt, we have a number of summary statistics that give us information about the model. As you can see from the print out above, we have information for each variable on the regression coefficient \begin{align*}(\beta)\end{align*}, the standard error of the regression coefficient se\begin{align*}(\beta)\end{align*} and the \begin{align*}R^2\end{align*} value.
Using this information, we can take all of the regression coefficients and put them together to make our model. In this example, our regression equation would be \begin{align*}\hat{Y} = -121.66 + 1.51X + 12.53Z\end{align*}. Each of these coefficients tells us something about the relationship between the predictor variable and the predicted outcome. The temperature coefficient of \begin{align*}1.51\end{align*} tells us that for every \begin{align*}1.0 \;\mathrm{degree}\end{align*} increase in temperature, we predict there to be an increase of \begin{align*}1.5 \;\mathrm{ounce}\end{align*} of water consumed if we hold the practice time constant. Similarly, we find that with every \begin{align*}10 \;\mathrm{minute}\end{align*} increase in practice time, we predict players to consume an additional \begin{align*}15 \;\mathrm{ounces}\end{align*} of water if we hold the temperature constant.
With an \begin{align*}R^2\end{align*} of \begin{align*}0.99\end{align*}, we can conclude that approximately \begin{align*}99\%\end{align*} of the variance in the outcome variable \begin{align*}(Y)\end{align*} can be explained by the variance in the combined predictor variables. Notice that the adjusted \begin{align*}R^2\end{align*} is only slightly different from the unadjusted \begin{align*}R^2\end{align*}. This is due to the relatively small number of observations and the small number of predicted variables. With an \begin{align*}R^2\end{align*} of \begin{align*}0.99\end{align*} we can conclude that almost all of the variance in water consumption is attributed to the variance in temperature and practice time.
Testing for Significance to Evaluate a Hypothesis, the Standard Error of a Coefficient and Constructing Confidence Intervals
When we perform multiple regression analysis, we are essentially trying to determine if our predictor variables explain the variation in the outcome variable \begin{align*}(Y)\end{align*}. When we put together our final model, we are looking at whether or not the variables explain most of the variation \begin{align*}(R^2)\end{align*} and if this \begin{align*}R^2\end{align*} value is statistically significant. We can use technological tools to conduct a hypothesis test testing the significance of this \begin{align*}R^2\end{align*} value and in constructing confidence intervals around these results.
Hypothesis Testing
When we conduct a hypothesis test, we test the null hypothesis that the multiple \begin{align*}R\end{align*} value in the population equals zero \begin{align*}(H_0 = R_{\mathrm{pop}} = 0)\end{align*}. Under this scenario, the predicted or fitted values would all be very close to the mean and the deviations \begin{align*}(\hat{Y} - \bar{Y})\end{align*} or the sum of squares would be very small (close to \begin{align*}0\end{align*}). Therefore, we want to calculate a test statistic (in this case the \begin{align*}F\end{align*} statistic) that measures the correlation between the predictor variables. If this test statistic is beyond the critical values and the null hypothesis is rejected, we can conclude that there is a nonzero relationship between the criterion variable \begin{align*}(Y)\end{align*} and the predictor variables. When we reject the null hypothesis we can say something to the effect of “The probability that \begin{align*}R^2=XX\end{align*} would have occurred by chance if the null hypothesis were true is less than \begin{align*}.05\end{align*} (or \begin{align*}.10, .01\end{align*}, etc.).” As mentioned, we can use computer programs to determine the \begin{align*}F-\end{align*}statistic and its significance.
Let’s take a look at the example above and interpret the \begin{align*}F\end{align*} value. We see that we have a very high \begin{align*}R^2\end{align*} value of \begin{align*}0.99\end{align*} which means that almost all of the variance in the outcome variable (water consumption) can be explained by the predictor variables (practice time and temperature). Our ANOVA (ANalysis Of VAriance) table tells us that we have a calculated \begin{align*}F\end{align*} statistic of \begin{align*}313.17\end{align*}, which has an associated probability value of \begin{align*}4.03E-05 (0.0000403)\end{align*}. This means that the probability that \begin{align*}0.99\end{align*} of the variance would have occurred by chance if the null hypothesis were true (i.e., none of the variance explained) is \begin{align*}0.0000403\end{align*}. In other words, it is highly unlikely that this large level of explained variance was by chance.
Standard Error of a Coefficient and Testing for Significance
In addition to performing a test to assess the probability of the regression line occurring by chance, we can also test the significance of individual coefficients. This is helpful in determining whether or not the variable significantly contributes to the regression. For example, if we find that a variable does not significantly contribute to the regression we may choose not to include it in the final regression equation. Again, we can use computer programs to determine the standard error, the test statistic and its level of significance.
Looking at our example above we see that Excel has calculated the standard error and the test statistic (in this case, the \begin{align*}t\end{align*}-statistic) for each of the predictor variables. We see that temperature has a \begin{align*}t\end{align*}-statistic of \begin{align*}24.88\end{align*} and a corresponding p-value of \begin{align*}1.55E-05\end{align*} and that practice time has a \begin{align*}t\end{align*}-statistic of \begin{align*}6.48\end{align*} and a corresponding p-value of \begin{align*}0.002918\end{align*}. Depending on the situation, we can set our critical values at \begin{align*}0.10, 0.05\end{align*}, \begin{align*}0.01\end{align*}, etc. For this situation, we will use a \begin{align*}p\end{align*}-value of \begin{align*}.05\end{align*}. Since both variables have \begin{align*}t\end{align*}-values that exceed the critical value, we can determine that both of these variables significantly contribute to the variance of the outcome variable and should be included in the regression equation.
Calculating the Confidence Interval for a Coefficient
We can also use technological tools to build a confidence interval around our regression coefficients. Remember earlier in the lesson we calculated confidence intervals around certain values in linear regression models. However, this concept is a bit different when we work with multiple regression models.
For the predictor variables in multiple regression, the confidence interval is based on t-tests and is the range around the observed sample regression coefficient, within which we can be \begin{align*}95\%\end{align*} (or any other predetermined level) confident the real regression coefficient for the population lies. In this example, we can say that we are \begin{align*}95\%\end{align*} confident that the population regression coefficient for temperature is between \begin{align*}1.34\end{align*} (the Lower \begin{align*}95\%\end{align*} entry) and \begin{align*}1.68\end{align*} (the Upper \begin{align*}95\%\end{align*} entry). In addition, we are \begin{align*}95\%\end{align*} confident that the population regression coefficient for practice time is between \begin{align*}7.16\end{align*} and \begin{align*}17.90\end{align*}.
Lesson Summary
1. In multiple linear regression, scores for one variable are predicted using multiple predictor variables. The regression equation we use is
\begin{align*}Y = b_1 X_1 +b_2 X_2 + \text{etc}.\end{align*}
2. When calculating the different parts of the multiple regression equation we can use a number of computer programs such as Microsoft Excel, SPSS and SAS.
3. These programs calculate the multiple regression coefficients, combined \begin{align*}R^2\end{align*} value and confidence interval for the regression coefficients.
Supplemental Links
- Manuals by a professor at Western Kentucky University for use in statistics, plus TI-83/4 programs for multiple regression that are available for download.
- Texas Instrument Website that includes supplemental activities and practice problems using the TI-83 calculator
Review Questions
The lead English teacher is trying to determine the relationship between three tests given throughout the semester and the final exam. She decides to conduct a mini-study on this relationship and collects the test data (scores for Test 1, Test 2, Test 3 and the final exam) for \begin{align*}50\end{align*} students in freshman English. She enters these data into Microsoft Excel and arrives at the following summary statistics:
\begin{align*}& \text{Multiple R} && 0.6859 \\ & \text{R Square} && 0.4707 \\ & \text{Adjusted R Square} && 0.4369 \\ & \text{Standard Error} && 7.5718 \\ & \text{Observations} && 50\end{align*}
\begin{align*}df\end{align*} | \begin{align*}SS\end{align*} | \begin{align*}MS\end{align*} | \begin{align*}F\end{align*} | Significance \begin{align*}F\end{align*} | ||
---|---|---|---|---|---|---|
Regression | \begin{align*}3\end{align*} | \begin{align*}2342.7228\end{align*} | \begin{align*}780.9076\end{align*} | \begin{align*}13.621\end{align*} | \begin{align*}.0000\end{align*} | |
Residual | \begin{align*}46\end{align*} | \begin{align*}2637.2772\end{align*} | \begin{align*}57.3321\end{align*} | |||
Total | \begin{align*}49\end{align*} | \begin{align*}4980.0000\end{align*} |
Coefficients | Standard Error | \begin{align*}t\end{align*} Stat | \begin{align*}P-\end{align*}value | |
---|---|---|---|---|
Intercept | \begin{align*}10.7592\end{align*} | \begin{align*}7.6268\end{align*} | ||
Test 1 | \begin{align*}0.0506\end{align*} | \begin{align*}.1720\end{align*} | \begin{align*}.2941\end{align*} | \begin{align*}.7700\end{align*} |
Test 2 | \begin{align*}.5560\end{align*} | \begin{align*}.1431\end{align*} | \begin{align*}3.885\end{align*} | \begin{align*}.0003\end{align*} |
Test 3 | \begin{align*}.2128\end{align*} | \begin{align*}.1782\end{align*} | \begin{align*}1.194\end{align*} | \begin{align*}.2387\end{align*} |
- How many predictor variables are there in this scenario? What are the names of these predictor variables?
- What does the regression coefficient for Test 2 tell us?
- What is the regression model for this analysis?
- What is the \begin{align*}R^2\end{align*} value and what does it indicate?
- Determine whether the multiple \begin{align*}R\end{align*} is statistically significant.
- Which of the predictor variables are statistically significant? What is the reasoning behind this decision?
- Given this information, would you include all three predictor variables in the multiple regression model? Why or why not?
Review Answers
- There are 3 predictor values – Test 1, Test 2 and Test 3.
- The regression coefficient of \begin{align*}0.5560\end{align*} tells us that every \begin{align*}0.5560 \;\mathrm{percent}\end{align*} change in Test 2 is associated with a \begin{align*}1.000 \;\mathrm{percent}\end{align*} change in the final exam when everything else is held constant.
- From the data given, the regression equation is \begin{align*}Y = 0.0506 \ \ X_1 + 0.5560 \ \ X_2 +0.2128 \ \ X_3 +10.7592\end{align*}.
- The \begin{align*}R^2\end{align*} value is \begin{align*}0.4707\end{align*} and indicates that \begin{align*}47\%\end{align*} of the variance in the final exam can be attributed to the variance of the combined predictor variables.
- Using the print out, we see that the \begin{align*}F\end{align*} statistic is \begin{align*}13.621\end{align*} and has a corresponding \begin{align*}p\end{align*} value of \begin{align*}0.000\end{align*}. This means that the probability that the observed \begin{align*}R\end{align*} value would have occurred by chance if it was not significant is very small (slightly greater than \begin{align*}0.000\end{align*})
- Test 2. Upon closer examination, we find that only the Test 2 predictor variable is significantly significant since the \begin{align*}t\end{align*} value of \begin{align*}3.885\end{align*} exceeds the critical value (as evidenced by the low \begin{align*}p\end{align*} value of \begin{align*}.003\end{align*}).
- No. It is not necessary to include Test 1 and Test 3 in the multiple regression model since these two variables do not have a significant test statistic that exceeds the critical value.