9.4: Multiple Regression
Learning Objectives
- Understand a multiple regression equation and the coefficients of determination for correlation of three or more variables.
- Calculate a multiple regression equation using technological tools.
- Calculate the standard error of a coefficient, test a coefficient for significance to evaluate a hypothesis, and calculate the confidence interval for a coefficient using technological tools.
Introduction
In the previous sections, we learned a bit about examining the relationship between two variables by calculating the correlation coefficient and the linear regression line. But, as we all know, often times we work with more than two variables. For example, what happens if we want to examine the impact that class size and number of faculty members have on a university's ranking. Since we are taking multiple variables into account, the linear regression model just won’t work. In multiple linear regression, scores for one variable are predicted (in this example, a university's ranking) using multiple predictor variables (class size and number of faculty members).
Another common use of multiple regression models is in the estimation of the selling price of a home. There are a number of variables that go into determining how much a particular house will cost, including the square footage, the number of bedrooms, the number of bathrooms, the age of the house, the neighborhood, and so on. Analysts use multiple regression to estimate the selling price in relation to all of these different types of variables.
In this section, we will examine the components of a multiple regression equation, calculate an equation using technological tools, and use this equation to test for significance in order to evaluate a hypothesis.
Understanding a Multiple Regression Equation
If we were to try to draw a multiple regression model, it would be a bit more difficult than drawing a model for linear regression. Let’s say that we have two predictor variables, \begin{align*}X_1\end{align*}
\begin{align*}\hat{Y}=b_1X_1 + b_2X_2 + a\end{align*}
When there are two predictor variables, the scores must be plotted in three dimensions (see figure below). When there are more than two predictor variables, we would continue to plot these in multiple dimensions. Regardless of how many predictor variables there are, we still use the least squares method to try to minimize the distance between the actual and predicted values.
When predicting values using multiple regression, we first use the standard score form of the regression equation, which is shown below:
\begin{align*}\hat{Y} = \beta_1X_1 + \beta_2X_2 + \ldots + \beta_iX_i\end{align*}
where:
\begin{align*}\hat{Y}\end{align*}
\begin{align*}\beta_i\end{align*}
\begin{align*}X_i\end{align*}
To solve for the regression and constant coefficients, we need to determine multiple correlation coefficients, \begin{align*}r\end{align*}
In most situations, we use a computer to calculate the multiple regression equation and determine the coefficients in this equation. We can also do multiple regression on a TI-83/84 calculator. (This program can be downloaded.)
Technology Note: Multiple Regression Analysis on the TI-83/84 Calculator
http://www.wku.edu/~david.neal/manual/ti83.html
Download a program for multiple regression analysis on the TI-83/84 calculator by first clicking on the link above.
It is helpful to explain the calculations that go into a multiple regression equation so we can get a better understanding of how this formula works.
After we find the correlation values, \begin{align*}r\end{align*}
\begin{align*}\beta_1 & = \frac{r_{Y1} - (r_{Y2}) (r_{12})}{1-r^2_{12}}\\
\beta_2 & = \frac{r_{Y2}-(r_{Y1}) (r_{12})}{1-r^2_{12}}\end{align*}
where:
\begin{align*}\beta_1\end{align*}
\begin{align*}\beta_2\end{align*}
\begin{align*}r_{Y1}\end{align*}
\begin{align*}r_{Y2}\end{align*}
\begin{align*}r_{12}\end{align*}
After solving for the beta coefficients, we can then compute the \begin{align*}b\end{align*}
\begin{align*}b_1 & = \beta_1\left ( \frac{s_Y}{s_1} \right )\\
b_2 & = \beta_2 \left ( \frac{s_Y}{s_2} \right )\end{align*}
where:
\begin{align*}s_Y\end{align*}
\begin{align*}s_1\end{align*}
After solving for the regression coefficients, we can finally solve for the regression constant by using the formula shown below, where \begin{align*}k\end{align*}
\begin{align*}a=\bar{y}-\sum_{i=1}^k b_i\bar{x}_i\end{align*}
Again, since these formulas and calculations are extremely tedious to complete by hand, we usually use a computer or a TI-83/84 calculator to solve for the coefficients in a multiple regression equation.
Calculating a Multiple Regression Equation using Technological Tools
As mentioned, there are a variety of technological tools available to calculate the coefficients in a multiple regression equation. When using a computer, there are several programs that help us calculate the multiple regression equation, including Microsoft Excel, the Statistical Analysis Software (SAS), and the Statistical Package for the Social Sciences (SPSS). Each of these programs allows the user to calculate the multiple regression equation and provides summary statistics for each of the models.
For the purposes of this lesson, we will synthesize summary tables produced by Microsoft Excel to solve problems with multiple regression equations. While the summary tables produced by the different technological tools differ slightly in format, they all provide us with the information needed to build a multiple regression equation, conduct hypothesis tests, and construct confidence intervals. Let’s take a look at an example of a summary statistics table so we get a better idea of how we can use technological tools to build multiple regression equations.
Example: Suppose we want to predict the amount of water consumed by football players during summer practices. The football coach notices that the water consumption tends to be influenced by the time that the players are on the field and by the temperature. He measures the average water consumption, temperature, and practice time for seven practices and records the following data:
Temperature (degrees \begin{align*}F\end{align*}) | Practice Time (hrs) | \begin{align*}H_2O\end{align*} Consumption (in ounces) |
---|---|---|
75 | 1.85 | 16 |
83 | 1.25 | 20 |
85 | 1.5 | 25 |
85 | 1.75 | 27 |
92 | 1.15 | 32 |
97 | 1.75 | 48 |
99 | 1.6 | 48 |
Figure: Water consumption by football players compared to practice time and temperature.
Technology Note: Using Excel for Multiple Regression
- Copy and paste the table into an empty Excel worksheet.
- Select 'Data Analysis' from the Tools menu and choose 'Regression' from the list that appears.
- Place the cursor in the 'Input Y range' field and select the third column.
- Place the cursor in the 'Input X range' field and select the first and second columns.
- Place the cursor in the 'Output Range' field and click somewhere in a blank cell below and to the left of the table.
- Click 'Labels' so that the names of the predictor variables will be displayed in the table.
- Click 'OK', and the results shown below will be displayed.
SUMMARY OUTPUT
Regression Statistics
\begin{align*}& \text{Multiple R} && 0.996822 \\ & \text{R Square } && 0.993654 \\ & \text{Adjusted R Square} && 0.990481 \\ & \text{Standard Error} && 1.244877\\ & \text{Observations} && 7\end{align*}
\begin{align*}Df\end{align*} | \begin{align*}SS\end{align*} | \begin{align*}MS\end{align*} | \begin{align*}F\end{align*} | Significance \begin{align*}F\end{align*} | ||
---|---|---|---|---|---|---|
Regression | 2 | 970.6583 | 485.3291 | 313.1723 | 4.03E-05 | |
Residual | 4 | 6.198878 | 1.549719 | |||
Total | 6 | 976.8571 |
Coefficients | Standard Error | \begin{align*}t\end{align*} Stat | \begin{align*}P\end{align*}-value | Lower 95% | Upper 95% | |
---|---|---|---|---|---|---|
Intercept | \begin{align*}-121.655\end{align*} | 6.540348 | \begin{align*}-18.6007\end{align*} | 4.92e-05 | \begin{align*}-139.814\end{align*} | \begin{align*}-103.496\end{align*} |
Temperature | 1.512364 | 0.060771 | 24.88626 | 1.55E-05 | 1.343636 | 1.681092 |
Practice Time | 12.53168 | 1.93302 | 6.482954 | 0.002918 | 7.164746 | 17.89862 |
In this example, we have a number of summary statistics that give us information about the regression equation. As you can see from the results above, we have the regression coefficient and standard error for each variable, as well as the value of \begin{align*}r^2\end{align*}. We can take all of the regression coefficients and put them together to make our equation.
Using the results above, our regression equation would be \begin{align*}\hat{Y} = -121.66 + 1.51 (\text{Temperature}) + 12.53 (\text{Practice Time})\end{align*}.
Each of the regression coefficients tells us something about the relationship between the predictor variable and the predicted outcome. The temperature coefficient of 1.51 tells us that for every 1.0-degree increase in temperature, we predict there to be an increase of 1.5 ounces of water consumed, if we hold the practice time constant. Similarly, we find that with every 10-minute increase in practice time, we predict players will consume an additional 2.1 ounces of water, if we hold the temperature constant.
With a value of 0.99 for \begin{align*}r^2\end{align*}, we can conclude that approximately 99% of the variance in the outcome variable, \begin{align*}Y\end{align*}, can be explained by the variance in the combined predictor variables. Notice that the adjusted value of \begin{align*}r^2\end{align*} is only slightly different from the unadjusted value of \begin{align*}r^2\end{align*}. This is due to the relatively small number of observations and the small number of predictor variables. With a value of 0.99 for \begin{align*}r^2\end{align*}, we can conclude that almost all of the variance in water consumption is attributed to the variance in temperature and practice time.
Testing for Significance to Evaluate a Hypothesis, the Standard Error of a Coefficient, and Constructing Confidence Intervals
When we perform multiple regression analysis, we are essentially trying to determine if our predictor variables explain the variation in the outcome variable, \begin{align*}Y\end{align*}. When we put together our final equation, we are looking at whether or not the variables explain most of the variation, \begin{align*}r^2\end{align*}, and if this value of \begin{align*}r^2\end{align*} is statistically significant. We can use technological tools to conduct a hypothesis test, testing the significance of this value of \begin{align*}r^2\end{align*}, and construct confidence intervals around these results.
Hypothesis Testing
When we conduct a hypothesis test, we test the null hypothesis that the multiple \begin{align*}r\end{align*}-value in the population equals zero, or \begin{align*}H_0 : r_{\text{pop}} = 0\end{align*}. Under this scenario, the predicted values, or fitted values, would all be very close to the mean, and the deviations, \begin{align*}\hat{Y}-\bar{Y}\end{align*}, and the sum of the squares would be close to 0. Therefore, we want to calculate a test statistic (in this case, the \begin{align*}F\end{align*}-statistic) that measures the correlation between the predictor variables. If this test statistic is beyond the critical values and the null hypothesis is rejected, we can conclude that there is a nonzero relationship between the criterion variable, \begin{align*}Y\end{align*}, and the predictor variables. When we reject the null hypothesis, we can say something like, “The probability that \begin{align*}r^2\end{align*} having the value obtained would have occurred by chance if the null hypothesis were true is less than 0.05 (or whatever the significance level happens to be).” As mentioned, we can use computer programs to determine the \begin{align*}F\end{align*}-statistic and its significance.
Let’s take a look at the example above and interpret the \begin{align*}F\end{align*}-statistic. We see that we have a very high value of \begin{align*}r^2\end{align*} of 0.99, which means that almost all of the variance in the outcome variable (water consumption) can be explained by the predictor variables (practice time and temperature). Our ANOVA (ANalysis Of VAriance) table tells us that we have a calculated \begin{align*}F\end{align*}-statistic of 313.17, which has an associated probability value of 4.03e-05. This means that the probability that 99 percent of the variance would have occurred by chance if the null hypothesis were true (i.e., none of the variance was explained) is 0.0000403. In other words, it is highly unlikely that this large level of variance was by chance. \begin{align*}F\end{align*}-distributions will be discussed in greater detail in a later chapter.
Standard Error of a Coefficient and Testing for Significance
In addition to performing a test to assess the probability of the regression line occurring by chance, we can also test the significance of individual coefficients. This is helpful in determining whether or not the variable significantly contributes to the regression. For example, if we find that a variable does not significantly contribute to the regression, we may choose not to include it in the final regression equation. Again, we can use computer programs to determine the standard error, the test statistic, and its level of significance.
Example: Looking at our example above, we see that Excel has calculated the standard error and the test statistic (in this case, the \begin{align*}t\end{align*}-statistic) for each of the predictor variables. We see that temperature has a \begin{align*}t\end{align*}-statistic of 24.88 and a corresponding \begin{align*}P\end{align*}-value of 1.55e-05. We also see that practice time has a \begin{align*}t\end{align*}-statistic of 6.48 and a corresponding \begin{align*}P\end{align*}-value of 0.002918. For this situation, we will set \begin{align*}\alpha\end{align*} equal to 0.05. Since the \begin{align*}P\end{align*}-values for both variables are less than \begin{align*}\alpha=0.05\end{align*}, we can determine that both of these variables significantly contribute to the variance of the outcome variable and should be included in the regression equation.
Calculating the Confidence Interval for a Coefficient
We can also use technological tools to build a confidence interval around our regression coefficients. Remember, earlier in the chapter we calculated confidence intervals around certain values in linear regression models. However, this concept is a bit different when we work with multiple regression models.
For a predictor variable in multiple regression, the confidence interval is based on a \begin{align*}t\end{align*}-test and is the range around the observed sample regression coefficient within which we can be 95% (or any other predetermined level) confident that the real regression coefficient for the population lies. In this example, we can say that we are 95% confident that the population regression coefficient for temperature is between 1.34 (the Lower 95% entry) and 1.68 (the Upper 95% entry). In addition, we are 95% confident that the population regression coefficient for practice time is between 7.16 and 17.90.
Lesson Summary
In multiple linear regression, scores for the criterion variable are predicted using multiple predictor variables. The regression equation we use for two predictor variables, \begin{align*}X_1\end{align*} and \begin{align*}X_2\end{align*}, is as follows:
\begin{align*}\hat{Y}=b_1X_1 + b_2X_2 + a\end{align*}
When calculating the different parts of the multiple regression equation, we can use a number of computer programs, such as Microsoft Excel, SPSS, and SAS.
These programs calculate the multiple regression coefficients, the combined value of \begin{align*}r^2\end{align*}, and the confidence intervals for the regression coefficients.
On the Web
www.wku.edu/~david.neal/web1.html
Manuals by a professor at Western Kentucky University for use in statistics, plus TI-83/84 programs for multiple regression that are available for download.
http://education.ti.com/educationportal/activityexchange/activity_list.do
Texas Instrument Website that includes supplemental activities and practice problems using the TI-83 calculator.
Review Questions
- A lead English teacher is trying to determine the relationship between three tests given throughout the semester and the final exam. She decides to conduct a mini-study on this relationship and collects the test data (scores for Test 1, Test 2, Test 3, and the final exam) for 50 students in freshman English. She enters these data into Microsoft Excel and arrives at the following summary statistics:
\begin{align*}& \text{Multiple R} && 0.6859 \\ & \text{R Square} && 0.4707 \\ & \text{Adjusted R Square} && 0.4369 \\ & \text{Standard Error} && 7.5718 \\ & \text{Observations} && 50\end{align*}
\begin{align*}Df\end{align*} | \begin{align*}SS\end{align*} | \begin{align*}MS\end{align*} | \begin{align*}F\end{align*} | Significance \begin{align*}F\end{align*} | ||
---|---|---|---|---|---|---|
Regression | 3 | 2342.7228 | 780.9076 | 13.621 | 0.0000 | |
Residual | 46 | 2637.2772 | 57.3321 | |||
Total | 49 | 4980.0000 |
Coefficients | Standard Error | \begin{align*}t\end{align*} Stat | \begin{align*}P\end{align*}-value | |
---|---|---|---|---|
Intercept | 10.7592 | 7.6268 | ||
Test 1 | 0.0506 | 0.1720 | 0.2941 | 0.7700 |
Test 2 | 0.5560 | 0.1431 | 3.885 | 0.0003 |
Test 3 | 0.2128 | 0.1782 | 1.194 | 0.2387 |
(a) How many predictor variables are there in this scenario? What are the names of these predictor variables?
(b) What does the regression coefficient for Test 2 tell us?
(c) What is the regression model for this analysis?
(d) What is the value of \begin{align*}r^2\end{align*}, and what does it indicate?
(e) Determine whether the multiple \begin{align*}r\end{align*}-value is statistically significant.
(f) Which of the predictor variables are statistically significant? What is the reasoning behind this decision?
(g) Given this information, would you include all three predictor variables in the multiple regression model? Why or why not?
Keywords
- Bivariate data
- Bivariate data is primarily examined to show some sort of relationship between two variables. Bivariate data usually has an independent variable and a dependent variable.
- Coefficient of determination
- The coefficient of determination \begin{align*}R^2\end{align*} is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information.
- Conditional distribution
- the conditional distribution is used to determine the percentage of \begin{align*}Y\end{align*} values above or below a certain value that are associated with a specific value of \begin{align*}X\end{align*}.
- Correlation
- The correlation describes the direction of the direction.
- Correlation coefficient\begin{align*}(r)\end{align*}
- can be used to express correlation.
- Criterion variable
- the criterion variable are predicted using multiple predictor variables.
- Curvlinear relationship
- Curvilinear relationships are nonlinear relationships. Just because they are nonlinear does not mean they don’t have a strong correlation.
- \begin{align*}e\end{align*}
- to minimize the distances of all data points to the regression line. These distances are called the error, \begin{align*}e\end{align*}, and are also known as the residual values.
- \begin{align*}F\end{align*}-statistic
- The \begin{align*}F\end{align*}-statistic that measures the correlation between the predictor variables.
- Homoscedasticity
- The standard deviations and the variances of each of these distributions for each of the predicted values are equal. This is called homoscedasticity.
- Least squares line
- If we could draw this straight line, it would, in theory, represent the change in one variable associated with the change in the other. This line is called the least squares line, or the linear regression line
- Line of best fit
- a line of best fit, and there are many ways one could define this best fit. Statisticians define this line to be the one which minimizes the sum of the squared distances from the observed data to the line.
- Linear regression
- Linear regression involves using data to calculate a line that best fits that data and then using that line to predict scores.
- Linear regression line
- If we could draw this straight line, it would, in theory, represent the change in one variable associated with the change in the other. This line is called the least squares line, or the linear regression line
- Magnitude
- The absolute value of the coefficient indicates the magnitude
- Method of least squares
- This method of fitting the data line so that there is minimal difference between the observations and the line is called the method of least squares
- Multiple regression
- multiple regression equation is to minimize the sum of the squared deviations from the observations to the regression plane.
- Near-zero correlation
- A scatterplot in which the points do not have a linear trend (either positive or negative) is called a zero correlation or a near-zero correlation
- Negative correlation
- a negative correlation coefficient indicating a negative correlation.
- Outcome variable
- one variable (the predictor variable) to predict the outcome of another (the outcome variable, or criterion variable).
- Outlier
- An outlier is an extreme observation that does not fit the general correlation or regression pattern
- Pearson product-moment correlation coefficient
- The Pearson product-moment correlation coefficient is a statistic that is used to measure the strength and direction of a linear correlation.
- Perfect correlation
- When all the points on a scatterplot lie on a straight line
- Positive correlation
- When the points on a scatterplot graph produce a lower-left-to-upper-right pattern
- Predictor variable
- predict values by simply substituting a value of a predictor variable
- \begin{align*}r\end{align*}
- \begin{align*}r\end{align*} is the correlation between the variables \begin{align*}X\end{align*} and \begin{align*}Y\end{align*}.
- \begin{align*}r^2\end{align*}
- \begin{align*}r^2\end{align*} is only slightly different from the unadjusted value of \begin{align*}r^2\end{align*}.
- Regression coefficient
- The regression coefficient explains the nature of the relationship between the two variables.
- Regression constant
- The regression constant is also the \begin{align*}y\end{align*}-intercept and is the place where the line crosses the \begin{align*}y\end{align*}-axis.
- Residual values
- to minimize the distances of all data points to the regression line. These distances are called the error, \begin{align*}e\end{align*}, and are also known as the residual values.
- Scatterplots
- A graph where each point represents a pair of measurements (two variables).
- Transformation
- we can perform a transformation to achieve a linear relationship.
- Zero correlation
- A scatterplot in which the points do not have a linear trend (either positive or negative) is called a zero correlation or a near-zero correlation