In this Concept you will learn about the multiple regression equation and the coefficients of determination for correlation of three or more variables. You will also learn to calculate a multiple regression equation using technological tools, as well as, the standard error of a coefficient, test a coefficient for significance to evaluate a hypothesis, and the confidence interval for a coefficient using technological tools.
For an example of multiple regression, see xeriland, Using Multiple Regression to Make Predictions (12:12).
In the previous Concepts, we learned a bit about examining the relationship between two variables by calculating the correlation coefficient and the linear regression line. But, as we all know, often times we work with more than two variables. For example, what happens if we want to examine the impact that class size and number of faculty members have on a university's ranking. Since we are taking multiple variables into account, the linear regression model just won’t work. In multiple linear regression, scores for one variable are predicted (in this example, a university's ranking) using multiple predictor variables (class size and number of faculty members).
Another common use of multiple regression models is in the estimation of the selling price of a home. There are a number of variables that go into determining how much a particular house will cost, including the square footage, the number of bedrooms, the number of bathrooms, the age of the house, the neighborhood, and so on. Analysts use multiple regression to estimate the selling price in relation to all of these different types of variables.
In this Concept, we will examine the components of a multiple regression equation, calculate an equation using technological tools, and use this equation to test for significance in order to evaluate a hypothesis.
Understanding a Multiple Regression Equation
If we were to try to draw a
model, it would be a bit more difficult than drawing a model for linear regression. Let’s say that we have two predictor variables,
When there are two predictor variables, the scores must be plotted in three dimensions (see figure below). When there are more than two predictor variables, we would continue to plot these in multiple dimensions. Regardless of how many predictor variables there are, we still use the least squares method to try to minimize the distance between the actual and predicted values.
When predicting values using multiple regression, we first use the standard score form of the regression equation, which is shown below:
To solve for the regression and constant coefficients, we need to determine multiple correlation coefficients,
In most situations, we use a computer to calculate the multiple regression equation and determine the coefficients in this equation. We can also do multiple regression on a TI-83/84 calculator. (This program can be downloaded.)
Technology Note: Multiple Regression Analysis on the TI-83/84 Calculator
Download a program for multiple regression analysis on the TI-83/84 calculator by first clicking on the link above.
It is helpful to explain the calculations that go into a multiple regression equation so we can get a better understanding of how this formula works.
After we find the correlation values,
After solving for the beta coefficients, we can then compute the
After solving for the regression coefficients, we can finally solve for the regression constant by using the formula shown below, where
Again, since these formulas and calculations are extremely tedious to complete by hand, we usually use a computer or a TI-83/84 calculator to solve for the coefficients in a multiple regression equation.
Calculating a Multiple Regression Equation using Technological Tools
As mentioned, there are a variety of technological tools available to calculate the coefficients in a multiple regression equation. When using a computer, there are several programs that help us calculate the multiple regression equation, including Microsoft Excel, the Statistical Analysis Software (SAS), and the Statistical Package for the Social Sciences (SPSS). Each of these programs allows the user to calculate the multiple regression equation and provides summary statistics for each of the models.
For the purposes of this lesson, we will synthesize summary tables produced by Microsoft Excel to solve problems with multiple regression equations. While the summary tables produced by the different technological tools differ slightly in format, they all provide us with the information needed to build a multiple regression equation, conduct hypothesis tests, and construct confidence intervals. Let’s take a look at an example of a summary statistics table so we get a better idea of how we can use technological tools to build multiple regression equations.
Suppose we want to predict the amount of water consumed by football players during summer practices. The football coach notices that the water consumption tends to be influenced by the time that the players are on the field and by the temperature. He measures the average water consumption, temperature, and practice time for seven practices and records the following data:
||Practice Time (hrs)||
Figure: Water consumption by football players compared to practice time and temperature.
Technology Note: Using Excel for Multiple Regression
- Copy and paste the table into an empty Excel worksheet.
- Click the Data choice on the toolbar, then select ’Data Analysis,’ and then choose ’Regression’ from the list that appears (Note, if Data Analysis does not appear as a choice on your Data page need to follow the add-in instructions below).
- Place the cursor in the ’Input Y range’ field and select the third column.
- Place the cursor in the ’Input X range’ field and select the first and second columns.
- Place the cursor in the ’Output Range’ field and click somewhere in a blank cell below and to the left of the table.
- Click ’Labels’ so that the names of the predictor variables will be displayed in the table.
- Click ’OK’, and the results shown below will be displayed.
Note: In Excel 2007, to add Data Analysis to your Data page, perform the following functions. Click the Microsoft Office Button in the upper left, then click on Excel Options . Click on Add-ins , then highlight the Analysis ToolPak , click Go , make sure the Analysis ToolPak box is checked off, and then click OK . The Data Analysis choice should now appear on your Excel Data page. Follow the remaining instructions above.
||Lower 95%||Upper 95%|
In this example, we have a number of summary statistics that give us information about the regression equation. As you can see from the results above, we have the regression coefficient and standard error for each variable, as well as the value of
Using the results above, our regression equation would be
Each of the regression coefficients tells us something about the relationship between the predictor variable and the predicted outcome. The temperature coefficient of 1.51 tells us that for every 1.0-degree increase in temperature, we predict there to be an increase of 1.5 ounces of water consumed, if we hold the practice time constant. Similarly, we find that with every one-hour increase in practice time, we predict players will consume an additional 12.53 ounces of water, if we hold the temperature constant. That equates to about 2.1 extra ounces of water for every 10 minutes increase in practice time.
With a value of 0.99 for
Testing for Significance to Evaluate a Hypothesis, the Standard Error of a Coefficient, and Constructing Confidence Intervals
When we perform multiple regression analysis, we are essentially trying to determine if our predictor variables explain the variation in the outcome variable,
When we conduct a hypothesis test, we test the null hypothesis that the multiple
Let’s take a look at the example above and interpret the
Standard Error of a Coefficient and Testing for Significance
In addition to performing a test to assess the probability of the regression line occurring by chance, we can also test the significance of individual coefficients. This is helpful in determining whether or not the variable significantly contributes to the regression. For example, if we find that a variable does not significantly contribute to the regression, we may choose not to include it in the final regression equation. Again, we can use computer programs to determine the standard error, the test statistic, and its level of significance.
Looking at our example above, we see that Excel has calculated the standard error and the test statistic (in this case, the
Calculating the Confidence Interval for a Coefficient
We can also use technological tools to build a confidence interval around our regression coefficients. Remember, earlier in the chapter we calculated confidence intervals around certain values in linear regression models. However, this concept is a bit different when we work with multiple regression models.
For a predictor variable in multiple regression, the confidence interval is based on a
On the Web
Manuals by a professor at Western Kentucky University for use in statistics, plus TI-83/84 programs for multiple regression that are available for download.
Texas Instrument Website that includes supplemental activities and practice problems using the TI-83 calculator.
multiple linear regression
, scores for the criterion variable are predicted using multiple predictor variables. The regression equation we use for two predictor variables,
When calculating the different parts of the multiple regression equation, we can use a number of computer programs, such as Microsoft Excel, SPSS, and SAS.
These programs calculate the
multiple regression coefficients
, the combined value of
For a study of crime in the United States, data for each of the fifty states and the District if Columbia was collected on the violent crime rate per 100,000 citizens, poverty rate as percent of the population, single parent households as percent of all state households, and urbanization as a percent of the population living in urban areas. The multiple regression output is shown below where
a. What is the least squares equation for the violent crime rate?
b. If the poverty rate is increased by 1 percent, with single parent households and urbanization unchanged, how would the violent crime rate change?
a. The equation is:
b. If the poverty rate is increased by .01 with the other two random variables held fixed, the poverty rate would increase would increase by .01 units. Students can determine this by replacing the three random variables with specific values, determining he poverty rate and then change only the coefficient of the first random variable to 13.41, an increase of .01 or 1% in the poverty rate.
For 1-7, a lead English teacher is trying to determine the relationship between three tests given throughout the semester and the final exam. She decides to conduct a mini-study on this relationship and collects the test data (scores for Test 1, Test 2, Test 3, and the final exam) for 50 students in freshman English. She enters these data into Microsoft Excel and arrives at the following summary statistics:
- How many predictor variables are there in this scenario? What are the names of these predictor variables?
- What does the regression coefficient for Test 2 tell us?
- What is the regression model for this analysis?
What is the value of
r2, and what does it indicate?
Determine whether the multiple
r-value is statistically significant.
- Which of the predictor variables are statistically significant? What is the reasoning behind this decision?
- Given this information, would you include all three predictor variables in the multiple regression model? Why or why not?
For all students at a particular university, the regression equation for
y=college GPA and x1=high school GPA and x2= college board score is y^=0.20+0.50x1+0.002x2a. Find the predicted college GPA for students
- Having a high school GPA of 4.0 and college board score of 800.
b. If a student retakes the college board exam and increases his score by 100 points, what will be the change in his predicted college GPA?
- When, in 1982 SAT scores were first published on a state-by-state basis I the US there was a huge variation in the scores. This was positive for some states and a problem for other states. Some researchers wanted to study which certain variables were associated with the state SAT differences. The variable SAT is the average total SAT (verbal + quantitative) score in the state and the two explanatory variables they considered were Takers (the percent of total eligible students in a state who took the exam) and Expend (total state expenditure on secondary schools, expressed in hundreds of dollars per student). Following is a piece of computer output from this study:
a. For Pennsylvania, SAT = 885, Takers = 50 and Expend = 27.98. What would you predict Pennsylvania’s average SAT score to be based on knowing its takers and expend, but not knowing its SAT? What is the residual for Pennsylvania
b. Use a test at the 0.05 significance level to test the hypothesis that Expend helps to predict SAT score once Takers are taken into account.
- Below is some computer output of the regression of January Temperature vs Latitude and Longitude, where January Temperature is the dependent variable
Number of cases 57
RSquare = 74.1%
RSqAdj = 73.1%
X = 6.935 with 56 – 3 = 53 degrees of freedom
|Variable||Coefficient||s.e. of Coeff||t-ratio||prob|
a. What is the regression equation?
b. What is the intercept and what does it represent?
c. For a fixed longitude, how does a change in latitude affect the January temperature?
d. Is there evidence that the longitude affects the January temperature for a given latitude? Test at the .05 level of significance.
Consider the following regression equation:
y^=116.84+0.832x1−0.951x2+2.34x3−1.08x4Using the regression equation complete the following table for four different sets of specific values for explanatory variables:
- Suppose that a college admissions committee plans to use data for total SAT score, high school grade point average and high school class rank to predict college freshman year grade point average for high school students applying for admission to the college. Write the regression model for this situation. Specify the response variable and the explanatory variables.
- Describe an example of a multiple linear regression model for a topic that is of interest to you. Specify the response variable and the explanatory variables and write the multiple regression model for your example.
Consider the following regression equation for predicting August temperature:
- For San Francisco, the average January temperature is 49 degrees F and the average April temperature is 56 degrees F. Use the regression equation to estimate the average August temperature for San Francisco.
- Find the residual for San Francisco if the actual average August temperature is 64 degrees F.
Suppose that a multiple linear regression model includes three explanatory variables.
- Write the population regression model using appropriate statistical notation.
Explain the difference between what is represented by the symbol
b3and the symbol β3.