9.4: Multiple Regression
Learning Objectives
- Understand a multiple regression equation and the coefficients of determination for correlation of three or more variables.
- Calculate a multiple regression equation using technological tools.
- Calculate the standard error of a coefficient, test a coefficient for significance to evaluate a hypothesis, and calculate the confidence interval for a coefficient using technological tools.
Introduction
In the previous sections, we learned a bit about examining the relationship between two variables by calculating the correlation coefficient and the linear regression line. But, as we all know, often times we work with more than two variables. For example, what happens if we want to examine the impact that class size and number of faculty members have on a university's ranking. Since we are taking multiple variables into account, the linear regression model just won’t work. In multiple linear regression, scores for one variable are predicted (in this example, a university's ranking) using multiple predictor variables (class size and number of faculty members).
Another common use of multiple regression models is in the estimation of the selling price of a home. There are a number of variables that go into determining how much a particular house will cost, including the square footage, the number of bedrooms, the number of bathrooms, the age of the house, the neighborhood, and so on. Analysts use multiple regression to estimate the selling price in relation to all of these different types of variables.
In this section, we will examine the components of a multiple regression equation, calculate an equation using technological tools, and use this equation to test for significance in order to evaluate a hypothesis.
Understanding a Multiple Regression Equation
If we were to try to draw a multiple regression model, it would be a bit more difficult than drawing a model for linear regression. Let’s say that we have two predictor variables, \begin{align*}X_1\end{align*}
\begin{align*}\hat{Y}=b_1X_1 + b_2X_2 + a\end{align*}
When there are two predictor variables, the scores must be plotted in three dimensions (see figure below). When there are more than two predictor variables, we would continue to plot these in multiple dimensions. Regardless of how many predictor variables there are, we still use the least squares method to try to minimize the distance between the actual and predicted values.
When predicting values using multiple regression, we first use the standard score form of the regression equation, which is shown below:
\begin{align*}\hat{Y} = \beta_1X_1 + \beta_2X_2 + \ldots + \beta_iX_i\end{align*}
where:
\begin{align*}\hat{Y}\end{align*}
\begin{align*}\beta_i\end{align*}
\begin{align*}X_i\end{align*}
To solve for the regression and constant coefficients, we need to determine multiple correlation coefficients, \begin{align*}r\end{align*}
In most situations, we use a computer to calculate the multiple regression equation and determine the coefficients in this equation. We can also do multiple regression on a TI-83/84 calculator. (This program can be downloaded.)
Technology Note: Multiple Regression Analysis on the TI-83/84 Calculator
http://www.wku.edu/~david.neal/manual/ti83.html
Download a program for multiple regression analysis on the TI-83/84 calculator by first clicking on the link above.
It is helpful to explain the calculations that go into a multiple regression equation so we can get a better understanding of how this formula works.
After we find the correlation values, \begin{align*}r\end{align*}
\begin{align*}\beta_1 & = \frac{r_{Y1} - (r_{Y2}) (r_{12})}{1-r^2_{12}}\\ \beta_2 & = \frac{r_{Y2}-(r_{Y1}) (r_{12})}{1-r^2_{12}}\end{align*}
where:
\begin{align*}\beta_1\end{align*}
\begin{align*}\beta_2\end{align*}
\begin{align*}r_{Y1}\end{align*}
\begin{align*}r_{Y2}\end{align*}
\begin{align*}r_{12}\end{align*}
After solving for the beta coefficients, we can then compute the \begin{align*}b\end{align*}
\begin{align*}b_1 & = \beta_1\left ( \frac{s_Y}{s_1} \right )\\ b_2 & = \beta_2 \left ( \frac{s_Y}{s_2} \right )\end{align*}
where:
\begin{align*}s_Y\end{align*} is the standard deviation of the criterion variable, \begin{align*}Y\end{align*}.
\begin{align*}s_1\end{align*} is the standard deviation of the particular predictor variable (1 for the first predictor variable, 2 for the second, and so on).
After solving for the regression coefficients, we can finally solve for the regression constant by using the formula shown below, where \begin{align*}k\end{align*} is the number of predictor variables:
\begin{align*}a=\bar{y}-\sum_{i=1}^k b_i\bar{x}_i\end{align*}
Again, since these formulas and calculations are extremely tedious to complete by hand, we usually use a computer or a TI-83/84 calculator to solve for the coefficients in a multiple regression equation.
Calculating a Multiple Regression Equation using Technological Tools
As mentioned, there are a variety of technological tools available to calculate the coefficients in a multiple regression equation. When using a computer, there are several programs that help us calculate the multiple regression equation, including Microsoft Excel, the Statistical Analysis Software (SAS), and the Statistical Package for the Social Sciences (SPSS). Each of these programs allows the user to calculate the multiple regression equation and provides summary statistics for each of the models.
For the purposes of this lesson, we will synthesize summary tables produced by Microsoft Excel to solve problems with multiple regression equations. While the summary tables produced by the different technological tools differ slightly in format, they all provide us with the information needed to build a multiple regression equation, conduct hypothesis tests, and construct confidence intervals. Let’s take a look at an example of a summary statistics table so we get a better idea of how we can use technological tools to build multiple regression equations.
Example: Suppose we want to predict the amount of water consumed by football players during summer practices. The football coach notices that the water consumption tends to be influenced by the time that the players are on the field and by the temperature. He measures the average water consumption, temperature, and practice time for seven practices and records the following data:
Temperature (degrees \begin{align*}F\end{align*}) | Practice Time (hrs) | \begin{align*}H_2O\end{align*} Consumption (in ounces) |
---|---|---|
75 | 1.85 | 16 |
83 | 1.25 | 20 |
85 | 1.5 | 25 |
85 | 1.75 | 27 |
92 | 1.15 | 32 |
97 | 1.75 | 48 |
99 | 1.6 | 48 |
Figure: Water consumption by football players compared to practice time and temperature.
Technology Note: Using Excel for Multiple Regression
- Copy and paste the table into an empty Excel worksheet.
- Click the Data choice on the toolbar, then select ’Data Analysis,’ and then choose ’Regression’ from the list that appears (Note, if Data Analysis does not appear as a choice on your Data page need to follow the add-in instructions below).
- Place the cursor in the ’Input Y range’ field and select the third column.
- Place the cursor in the ’Input X range’ field and select the first and second columns.
- Place the cursor in the ’Output Range’ field and click somewhere in a blank cell below and to the left of the table.
- Click ’Labels’ so that the names of the predictor variables will be displayed in the table.
- Click ’OK’, and the results shown below will be displayed.
Note: In Excel 2007, to add Data Analysis to your Data page, perform the following functions. Click the Microsoft Office Button in the upper left, then click on Excel Options. Click on Add-ins, then highlight the Analysis ToolPak, click Go, make sure the Analysis ToolPak box is checked off, and then click OK. The Data Analysis choice should now appear on your Excel Data page. Follow the remaining instructions above.
SUMMARY OUTPUT
Regression Statistics
\begin{align*}& \text{Multiple R} && 0.996822 \\ & \text{R Square } && 0.993654 \\ & \text{Adjusted R Square} && 0.990481 \\ & \text{Standard Error} && 1.244877\\ & \text{Observations} && 7\end{align*}
\begin{align*}Df\end{align*} | \begin{align*}SS\end{align*} | \begin{align*}MS\end{align*} | \begin{align*}F\end{align*} | Significance \begin{align*}F\end{align*} | ||
---|---|---|---|---|---|---|
Regression | 2 | 970.6583 | 485.3291 | 313.1723 | 4.03E-05 | |
Residual | 4 | 6.198878 | 1.549719 | |||
Total | 6 | 976.8571 |
Coefficients | Standard Error | \begin{align*}t\end{align*} Stat | \begin{align*}P\end{align*}-value | Lower 95% | Upper 95% | |
---|---|---|---|---|---|---|
Intercept | \begin{align*}-121.655\end{align*} | 6.540348 | \begin{align*}-18.6007\end{align*} | 4.92e-05 | \begin{align*}-139.814\end{align*} | \begin{align*}-103.496\end{align*} |
Temperature | 1.512364 | 0.060771 | 24.88626 | 1.55E-05 | 1.343636 | 1.681092 |
Practice Time | 12.53168 | 1.93302 | 6.482954 | 0.002918 | 7.164746 | 17.89862 |
In this example, we have a number of summary statistics that give us information about the regression equation. As you can see from the results above, we have the regression coefficient and standard error for each variable, as well as the value of \begin{align*}r^2\end{align*}. We can take all of the regression coefficients and put them together to make our equation.
Using the results above, our regression equation would be \begin{align*}\hat{Y} = -121.66 + 1.51 (\text{Temperature}) + 12.53 (\text{Practice Time})\end{align*}.
Each of the regression coefficients tells us something about the relationship between the predictor variable and the predicted outcome. The temperature coefficient of 1.51 tells us that for every 1.0-degree increase in temperature, we predict there to be an increase of 1.5 ounces of water consumed, if we hold the practice time constant. Similarly, we find that with every one-hour increase in practice time, we predict players will consume an additional 12.53 ounces of water, if we hold the temperature constant. That equates to about 2.1 extra ounces of water for every 10 minutes increase in practice time.
With a value of 0.99 for \begin{align*}r^2\end{align*}, we can conclude that approximately 99% of the variance in the outcome variable, \begin{align*}Y\end{align*}, can be explained by the variance in the combined predictor variables. With a value of 0.99 for \begin{align*}r^2\end{align*}, we can conclude that almost all of the variance in water consumption is attributed to the variance in temperature and practice time.
Testing for Significance to Evaluate a Hypothesis, the Standard Error of a Coefficient, and Constructing Confidence Intervals
When we perform multiple regression analysis, we are essentially trying to determine if our predictor variables explain the variation in the outcome variable, \begin{align*}Y\end{align*}. When we put together our final equation, we are looking at whether or not the variables explain most of the variation, \begin{align*}r^2\end{align*}, and if this value of \begin{align*}r^2\end{align*} is statistically significant. We can use technological tools to conduct a hypothesis test, testing the significance of this value of \begin{align*}r^2\end{align*}, and construct confidence intervals around these results.
Hypothesis Testing
When we conduct a hypothesis test, we test the null hypothesis that the multiple \begin{align*}r\end{align*}-value in the population equals zero, or \begin{align*}H_0 : r_{\text{pop}} = 0\end{align*}. Under this scenario, the predicted values, or fitted values, would all be very close to the mean, and the deviations, \begin{align*}\hat{Y}-\bar{Y}\end{align*}, and the sum of the squares would be close to 0. Therefore, we want to calculate a test statistic (in this case, the \begin{align*}F\end{align*}-statistic) that measures the correlation between the predictor variables. If this test statistic is beyond the critical values and the null hypothesis is rejected, we can conclude that there is a nonzero relationship between the criterion variable, \begin{align*}Y\end{align*}, and the predictor variables. When we reject the null hypothesis, we can say something like, “The probability that \begin{align*}r^2\end{align*} having the value obtained would have occurred by chance if the null hypothesis were true is less than 0.05 (or whatever the significance level happens to be).” As mentioned, we can use computer programs to determine the \begin{align*}F\end{align*}-statistic and its significance.
Let’s take a look at the example above and interpret the \begin{align*}F\end{align*}-statistic. We see that we have a very high value of \begin{align*}r^2\end{align*} of 0.99, which means that almost all of the variance in the outcome variable (water consumption) can be explained by the predictor variables (practice time and temperature). Our ANOVA (ANalysis Of VAriance) table tells us that we have a calculated \begin{align*}F\end{align*}-statistic of 313.17, which has an associated probability value of 4.03e-05. This means that the probability that 99 percent of the variance would have occurred by chance if the null hypothesis were true (i.e., none of the variance was explained) is 0.0000403. In other words, it is highly unlikely that this large level of variance was by chance. \begin{align*}F\end{align*}-distributions will be discussed in greater detail in a later chapter.
Standard Error of a Coefficient and Testing for Significance
In addition to performing a test to assess the probability of the regression line occurring by chance, we can also test the significance of individual coefficients. This is helpful in determining whether or not the variable significantly contributes to the regression. For example, if we find that a variable does not significantly contribute to the regression, we may choose not to include it in the final regression equation. Again, we can use computer programs to determine the standard error, the test statistic, and its level of significance.
Example: Looking at our example above, we see that Excel has calculated the standard error and the test statistic (in this case, the \begin{align*}t\end{align*}-statistic) for each of the predictor variables. We see that temperature has a \begin{align*}t\end{align*}-statistic of 24.88 and a corresponding \begin{align*}P\end{align*}-value of 1.55e-05. We also see that practice time has a \begin{align*}t\end{align*}-statistic of 6.48 and a corresponding \begin{align*}P\end{align*}-value of 0.002918. For this situation, we will set \begin{align*}\alpha\end{align*} equal to 0.05. Since the \begin{align*}P\end{align*}-values for both variables are less than \begin{align*}\alpha=0.05\end{align*}, we can determine that both of these variables significantly contribute to the variance of the outcome variable and should be included in the regression equation.
Calculating the Confidence Interval for a Coefficient
We can also use technological tools to build a confidence interval around our regression coefficients. Remember, earlier in the chapter we calculated confidence intervals around certain values in linear regression models. However, this concept is a bit different when we work with multiple regression models.
For a predictor variable in multiple regression, the confidence interval is based on a \begin{align*}t\end{align*}-test and is the range around the observed sample regression coefficient within which we can be 95% (or any other predetermined level) confident that the real regression coefficient for the population lies. In this example, we can say that we are 95% confident that the population regression coefficient for temperature is between 1.34 (the Lower 95% entry) and 1.68 (the Upper 95% entry). In addition, we are 95% confident that the population regression coefficient for practice time is between 7.16 and 17.90.
Lesson Summary
In multiple linear regression, scores for the criterion variable are predicted using multiple predictor variables. The regression equation we use for two predictor variables, \begin{align*}X_1\end{align*} and \begin{align*}X_2\end{align*}, is as follows:
\begin{align*}\hat{Y}=b_1X_1 + b_2X_2 + a\end{align*}
When calculating the different parts of the multiple regression equation, we can use a number of computer programs, such as Microsoft Excel, SPSS, and SAS.
These programs calculate the multiple regression coefficients, the combined value of \begin{align*}r^2\end{align*}, and the confidence intervals for the regression coefficients.
On the Web
www.wku.edu/~david.neal/web1.html
Manuals by a professor at Western Kentucky University for use in statistics, plus TI-83/84 programs for multiple regression that are available for download.
http://education.ti.com/educationportal/activityexchange/activity_list.do
Texas Instrument Website that includes supplemental activities and practice problems using the TI-83 calculator.
Review Questions
- A lead English teacher is trying to determine the relationship between three tests given throughout the semester and the final exam. She decides to conduct a mini-study on this relationship and collects the test data (scores for Test 1, Test 2, Test 3, and the final exam) for 50 students in freshman English. She enters these data into Microsoft Excel and arrives at the following summary statistics:
\begin{align*}& \text{Multiple R} && 0.6859 \\ & \text{R Square} && 0.4707 \\ & \text{Adjusted R Square} && 0.4369 \\ & \text{Standard Error} && 7.5718 \\ & \text{Observations} && 50\end{align*}
\begin{align*}Df\end{align*} | \begin{align*}SS\end{align*} | \begin{align*}MS\end{align*} | \begin{align*}F\end{align*} | Significance \begin{align*}F\end{align*} | ||
---|---|---|---|---|---|---|
Regression | 3 | 2342.7228 | 780.9076 | 13.621 | 0.0000 | |
Residual | 46 | 2637.2772 | 57.3321 | |||
Total | 49 | 4980.0000 |
Coefficients | Standard Error | \begin{align*}t\end{align*} Stat | \begin{align*}P\end{align*}-value | |
---|---|---|---|---|
Intercept | 10.7592 | 7.6268 | ||
Test 1 | 0.0506 | 0.1720 | 0.2941 | 0.7700 |
Test 2 | 0.5560 | 0.1431 | 3.885 | 0.0003 |
Test 3 | 0.2128 | 0.1782 | 1.194 | 0.2387 |
(a) How many predictor variables are there in this scenario? What are the names of these predictor variables?
(b) What does the regression coefficient for Test 2 tell us?
(c) What is the regression model for this analysis?
(d) What is the value of \begin{align*}r^2\end{align*}, and what does it indicate?
(e) Determine whether the multiple \begin{align*}r\end{align*}-value is statistically significant.
(f) Which of the predictor variables are statistically significant? What is the reasoning behind this decision?
(g) Given this information, would you include all three predictor variables in the multiple regression model? Why or why not?
Keywords
Bivariate data
Coefficient of determination
Conditional distribution
Correlation
Correlation coefficient
Criterion variable
Curvilinear relationship
\begin{align*}e\end{align*}
\begin{align*}F\end{align*}-statistic
Homoscedasticity
Least squares line
Line of best fit
Linear regression
Linear regression line
Magnitude
Method of least squares
Multiple regression
Near-zero correlation
Negative correlation
Outcome variable
Outlier
Pearson product-moment correlation coefficient
Perfect correlation
Positive correlation
Predictor variable
\begin{align*}r\end{align*}
\begin{align*}r^2\end{align*}
Regression coefficient
Regression constant
Residual values
Scatterplots
Transformation
Zero correlation
Image Attributions
Description
Tags:
Subjects:
Date Created:
Feb 23, 2012Last Modified:
Aug 11, 2015If you would like to associate files with this section, please make a copy first.