Statistics is largely concerned with how a change in one variable relates to changes in a second variable. **Bivariate data** is two lists of data that are paired up. Is there any relationship between the following data? If there is, does it mean that doctors cause cancer?

Number of Doctors | 27 | 30 | 36 | 60 | 81 | 90 | 156 | 221 | 347 |

Cancer Rate | 0.02 | 0.07 | 0.16 | 0.20 | 0.43 | 0.87 | 1.21 | 2.80 | 3.91 |

### Correlation

A **scatterplot** creates an \begin{align*}(x, y)\end{align*} point from each data pair. When making a scatterplot, you can try to assign the independent variable to \begin{align*}x\end{align*} and the dependent variable to \begin{align*}y\end{align*}; however, it will often not be obvious which variable is the dependent variable, so you will just have to pick one.

Once you plot the data and zoom appropriately you will see the points scattered about. Sometimes there will be a clear linear relationship and sometimes it will appear random. The **correlation coefficient**, \begin{align*}r\end{align*}, is a number that quantifies two aspects of the relationship between the data:

- The correlation coefficient is either negative, zero or positive. This tells you whether the data is negatively correlated, uncorrelated or positively correlated.
- The correlation coefficient is a number between \begin{align*}-1 \le r \le 1\end{align*} indicating the strength of correlation. If \begin{align*}r=1\end{align*} or \begin{align*}r=-1\end{align*} then the data is perfectly linear. Note that a perfectly linear relationship includes lines with slopes other than 1.

Consider the examples below to see what different correlation coefficients will look like in data:

In PreCalculus you will not learn how to calculate the correlation coefficient (you will if you take future statistics courses!). For now, the calculator will calculate it for you and your job will be to interpret the result.

If the data is sufficiently linear, then your calculator can perform a regression to produce the equation of a line that attempts to model the trend of the data. The regression line may actually pass through all, some or none of the data points. This **regression line** is represented in statistics by:

\begin{align*}\hat {y}=a+bx\end{align*}

The symbol \begin{align*}\hat{y}\end{align*} is pronounced “\begin{align*}y\end{align*}-hat” and is the predicted \begin{align*}y\end{align*} value based on a given \begin{align*}x\end{align*} value. Occasionally, you may also calculate the predicted \begin{align*}x\end{align*} value given a \begin{align*}y\end{align*} value, however this is less mathematically sound. Also notice that the linear regression model is simply a rearrangement of the standard equation of a line, \begin{align*}y=mx+b\end{align*}.

### Examples

#### Example 1

Earlier, you were asked about the relationship between the two sets of data:

Number of Doctors | 27 | 30 | 36 | 60 | 81 | 90 | 156 | 221 | 347 |

Cancer Rate | 0.02 | 0.07 | 0.16 | 0.20 | 0.43 | 0.87 | 1.21 | 2.80 | 3.91 |

Enter the data onto lists in your calculator:

Turn the [STAT PLOT] on that compares the two lists of data:

You should note that the data is extremely linear with a positive correlation coefficient:

A naïve conclusion would be to say that doctors cause cancer. One of the most misunderstood concepts in statistics is that correlation does not imply **causation**. Just because there is a correlation between the number of doctors and the cancer rate doesn’t mean that the number of doctors *causes* the cancer. There are dozens of reasons why more doctors might correlate with higher cancer rates. In general, remember that correlation is not the same as causation. Be careful before making any conclusions about change in one variable *causing* change in another variable.

#### Example 2

Estimate the correlation coefficient for the following scatterplots.

- \begin{align*}r \approx 0\end{align*}. Because the height \begin{align*}(y)\end{align*} does not seem to be dependent on the \begin{align*}x\end{align*}, the data is uncorrelated. Another way to see this is that the slope appears to be undefined.
- \begin{align*}r \approx -0.7\end{align*}. If the solo point in the bottom left is an outlier, you could choose to not include it in the data. Then, the \begin{align*}r\end{align*} value would be closer to -1.
- \begin{align*}r \approx +0.8\end{align*}. The clump of data seems to be slightly positive correlated and the single point in the upper left has a strong effect indicating positive slope.
- \begin{align*}r \approx -0.8\end{align*}. The data seems to be fairly strongly negatively correlated.
- \begin{align*}r \approx 1\end{align*}. The data seems to be perfectly linearly correlated.

#### Example 3

Estimate the regression line through the following scatterplots.

Visualize and sketch the “line of best fit” for each set of points.

Note that in part a, the regression line does not touch any point. Instead, it captures the general trend of the data. In part c, the correlation is not high enough in any direction to produce a regression line. The calculator may give a regression line for scatterplots that look like part c, but you need to be very skeptical that there is actually a relationship between the two variables.

#### Example 4

Use your calculator to perform a linear regression on the following data. Then, predict the height of someone who has shoe size 9.

Shoe Size |
Height (in) |

11 | 70 |

8.5 | 70 |

10 | 72 |

8 | 65 |

7 | 64 |

First enter the data.

Next perform the regression. Notice that the calculator can perform linear regression in two ways that are essentially the same. To keep consistent with \begin{align*}\hat{y}=a+bx\end{align*}, use linear regression. This is option 8 in the [STATS], [CALC] menu.

Now you need to tell the calculator to perform the regression on the two lists you want and where to copy the equation. The syntax is:

- \begin{align*}\text{LinReg}(a+bx) L_1, L_2, Y_1\end{align*}

*Note: to find Y1, go to -- [VARS], [Y-VARS], [FUNCTION], [Y1].*

Notice that the \begin{align*}r\end{align*} value is about 0.8. This indicates that there is a fairly strong positive correlation between shoe size and height. If you calculator does not display the \begin{align*}r\end{align*} and \begin{align*}r^2\end{align*} lines then you need to go into the catalog and run the program “DiagnosticOn”. This will enable the display of the correlation coefficient.

You can then graph the scatterplot and the regression line:

The regression equation is:

\begin{align*}\hat{y}=52.4069+1.7745 x\end{align*}

Where \begin{align*}x\end{align*} represents shoe size and \begin{align*}\hat{y}\end{align*} represents predicted height. The predicted height for someone with size 9 shoe is 68.3774:

\begin{align*}\hat{y}=52.4069+1.7745 \cdot 9=68.3774\end{align*}

An easy way to use the power of the calculator is to use function notation from the home screen:

#### Example 5

Shaquille O’Neal has size 23 shoes. What, if anything can you infer about his vocabulary? Does a larger shoe size cause a larger vocabulary?

Shaquille’s shoe size is significantly beyond the scope of the data that the model is based on. The data relates to elementary school students and a size 23 shoe is beyond the relevant domain. This means it wouldn’t make sense to use this model to predict Shaquille’s shoe size. Shoe size does not cause vocabulary, but the two variables are strongly correlated because over time both tend to grow.

### Review

For each correlation coefficient, describe what it means for data to have that correlation coefficient and sketch a scatterplot with that correlation coefficient.

1. \begin{align*}r=1\end{align*}

2. \begin{align*}r=-0.5\end{align*}

3. \begin{align*}r=-1\end{align*}

4. \begin{align*}r=0\end{align*}

5. \begin{align*}r=0.8\end{align*}

The data below shows the SAT math score and GPA for 7 different students.

SAT math score | 595 | 520 | 715 | 405 | 680 | 490 | 565 |

GPA | 3.4 | 3.2 | 3.9 | 2.3 | 3.9 | 2.5 | 3.5 |

6. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?

7. Use the equation from #6 to predict the GPA for a student with an SAT score of 500. Does this prediction seem reasonable given the data? Why or why not?

8. What is the relevant domain of this data?

9. Does a high SAT math score cause a high GPA?

The data below shows scores from two different quizzes for 10 different students.

Quiz 1 Score | 15 | 12 | 10 | 14 | 10 | 8 | 6 | 15 | 16 | 13 |

Quiz 2 Score | 20 | 15 | 12 | 18 | 10 | 13 | 12 | 10 | 18 | 15 |

10. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?

11. Use the equation from #10 to predict the Quiz 2 score for a student with a Quiz 1 score of 19. Does this prediction seem reasonable given the data? Why or why not?

12. What conclusions can you make about this data?

13. Explain in your own words the difference between causation and correlation.

14. Explain in your own words what the correlation coefficient measures.

15. Explain why a larger sample size will cause a more accurate correlation coefficient.

### Review (Answers)

To see the Review answers, open this PDF file and look for section 15.7.