# 1.9: Regression and Correlation

## Scatterplots and Linear Correlation

Graphical representations are going to be the primary focus of bivariate data in a first year class. Many of the techniques required for statistical analysis of multiple variables requires calculus, so a basic treatment is all students are equipped for at this point. The scatterplot is a very familiar structure to students at this point. In some ways your task is not going to be teaching the principles of a good scatterplot, but rather un-teaching the bad habits or misconceptions that students have developed over the years. Those bad habits are usually focused around poor scaling, sloppy labeling and a general lack of precision. If they are representing data for analytical use, they will need to take extra care in having a graph that is clear and accurate enough to be useful. Also, simply using a computer grapher is not sufficient, as the scaling can still induce poor conclusions.

For many complex measures, like the various correlation coefficients, a table where values are determined step by step, as on page 341, is very useful in keeping all of the variables and values in the correct place. (It’s also a great tool for finding standard deviations by hand. If you are careful, you will realize that there is plenty conceptually in common between the standard deviation and correlation coefficients.)

## Least Squares Regression

This is a biggie. Not only for the class or the AP exam, but for understanding all kinds of statistical work in the future. If you have students designing statistical projects, some of them will likely need to use least squares regression for their work. Plan a touch of extra time to make sure the class understands this section.

There are two ways to go about presenting this chapter. One is to have students work out their own plan for finding the best fit line. They should be capable of developing a number of different ideas, with a little bit of guidance. I start out by having students, or groups of students, construct a line with a straightedge that they think is the best fit. Inevitably, there will be some students who have different opinions. I ask if anyone can think of a way to test to see analytically who’s line is the closest fit for all of the data. Usually students get very close, if not exactly on the correct answer. The drawback to this method is that it takes time, and that it can be confusing for students in the long run. There are many valid methods of finding or confirming a line to fit data. Least squares is the most common in statistics, so students are expected to know it. Some may get confused if a number of ideas are presented.

The other method is to go straight for the table, like the text does for example 1. This is very quick, clear and usually results in a great rate of success for students working these problems. The drawback anytime a strict algorithmic method is applied is that students miss the concept behind the method. This is of lesser importance in this section as opposed to others.

Of additional information presented, all of it is good, but the only part that is really important for the AP examination is the part on calculating residuals. Students may also be asked in a free response question to perform a t-test on residuals to determine if the line is a good enough fit.

## Inferences about Regression

The AP examination will usually ask a question where students are required to make an inference about the correlation between two statistics. There are a number of steps to doing so, all of which have been covered somewhere in this text, but not all of them in this section. At this point, students should be reminded of all of the conditions that are required for this inference. Some of it may seem like it’s tedious routine, but it is easy to apply rules and tests in statistics in places where they are not going to give meaningful results. Furthermore, the results will seem logical, taking away the logical check system that students usually have.

The sample must be random, as with nearly anything that we are going to be looking at. This is again taken from the design of the experiment as covered in earlier chapters. The errors must be normal, which can be checked through various plotting methods. This is one condition that can frequently be overlooked without consequence, but is technically a requirement. Residual errors must be centered around . This can be figured with a plot, or by finding the mean of the errors. Along with the mean of the errors, and another reason to make a residual plot, the standard deviation of errors should be the same for all . The fact that the errors are independent can also be determined from this plot, provided no trends in plotting are popping up. Once all of these items are checked, then the process can proceed as stated in the examples in the text. Presenting the solution will be a good choice on the AP free response examination for clarity and to show that it is known what the requirements are for the test.

## Multiple Regression

Multiple variable regression is tough to visualize. This is compounded by the problem of making 3-D graphs. For this reason, and the fact that the there is no limit as to how many variables can be used, it makes sense to show a single example with a graph, and then move onto making the computations without a visual. There are many instances in mathematics where a two variables, or even three, are used to graphically develop a rule that can be extended beyond what can be represented; linear programming is a classic example. There is nothing lost by having students simply follow steps for a solution now that they have experience with regression for two variables. Another good plan is to use technology in this section for ease of solution or visualizing results. The text mentions SAS and SPSS but there are many stats packages that will perform regression with multiple variables.