<img src="https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=iA1Pi1a8Dy00ym" style="display:none" height="1" width="1" alt="" />

# Regression and Correlation

## Scatterplots, relationship between data, correlation coefficients, and regression.

Estimated12 minsto complete
%
Progress
Practice Regression and Correlation

MEMORY METER
This indicates how strong in your memory this concept is
Progress
Estimated12 minsto complete
%
Linear Correlation

Statistics is largely concerned with how a change in one variable relates to changes in a second variable. Bivariate data is two lists of data that are paired up. Is there any relationship between the following data? If there is, does it mean that doctors cause cancer?

 Number of Doctors 27 30 36 60 81 90 156 221 347 Cancer Rate 0.02 0.07 0.16 0.2 0.43 0.87 1.21 2.8 3.91

### Correlation

A scatterplot creates an \begin{align*}(x, y)\end{align*} point from each data pair. When making a scatterplot, you can try to assign the independent variable to \begin{align*}x\end{align*} and the dependent variable to \begin{align*}y\end{align*}; however, it will often not be obvious which variable is the dependent variable, so you will just have to pick one.

Once you plot the data and zoom appropriately you will see the points scattered about. Sometimes there will be a clear linear relationship and sometimes it will appear random. The correlation coefficient, \begin{align*}r\end{align*}, is a number that quantifies two aspects of the relationship between the data:

• The correlation coefficient is either negative, zero or positive. This tells you whether the data is negatively correlated, uncorrelated or positively correlated.
• The correlation coefficient is a number between \begin{align*}-1 \le r \le 1\end{align*} indicating the strength of correlation. If \begin{align*}r=1\end{align*} or \begin{align*}r=-1\end{align*} then the data is perfectly linear. Note that a perfectly linear relationship includes lines with slopes other than 1.

Consider the examples below to see what different correlation coefficients will look like in data:

In PreCalculus you will not learn how to calculate the correlation coefficient (you will if you take future statistics courses!). For now, the calculator will calculate it for you and your job will be to interpret the result.

If the data is sufficiently linear, then your calculator can perform a regression to produce the equation of a line that attempts to model the trend of the data. The regression line may actually pass through all, some or none of the data points. This regression line is represented in statistics by:

\begin{align*}\hat {y}=a+bx\end{align*}

The symbol \begin{align*}\hat{y}\end{align*} is pronounced “\begin{align*}y\end{align*}-hat” and is the predicted \begin{align*}y\end{align*} value based on a given \begin{align*}x\end{align*} value. Occasionally, you may also calculate the predicted \begin{align*}x\end{align*} value given a \begin{align*}y\end{align*} value, however this is less mathematically sound. Also notice that the linear regression model is simply a rearrangement of the standard equation of a line, \begin{align*}y=mx+b\end{align*}

### Examples

#### Example 1

Earlier, you were asked about the relationship between the two sets of data:

 Number of Doctors 27 30 36 60 81 90 156 221 347 Cancer Rate 0.02 0.07 0.16 0.2 0.43 0.87 1.21 2.8 3.91

Enter the data onto lists in your calculator:



Turn the [STAT PLOT] on that compares the two lists of data:



You should note that the data is extremely linear with a positive correlation coefficient:



A naïve conclusion would be to say that doctors cause cancer. One of the most misunderstood concepts in statistics is that correlation does not imply causation. Just because there is a correlation between the number of doctors and the cancer rate doesn’t mean that the number of doctors causes the cancer. There are dozens of reasons why more doctors might correlate with higher cancer rates. In general, remember that correlation is not the same as causation. Be careful before making any conclusions about change in one variable causing change in another variable.

#### Example 2

Estimate the correlation coefficient for the following scatterplots.

1. \begin{align*}r \approx 0\end{align*}. Because the height \begin{align*}(y)\end{align*} does not seem to be dependent on the \begin{align*}x\end{align*}, the data is uncorrelated. Another way to see this is that the slope appears to be undefined.
2. \begin{align*}r \approx -0.7\end{align*}. If the solo point in the bottom left is an outlier, you could choose to not include it in the data. Then, the \begin{align*}r\end{align*} value would be closer to -1.
3. \begin{align*}r \approx +0.8\end{align*}. The clump of data seems to be slightly positive correlated and the single point in the upper left has a strong effect indicating positive slope.
4. \begin{align*}r \approx -0.8\end{align*}. The data seems to be fairly strongly negatively correlated.
5. \begin{align*}r \approx 1\end{align*}. The data seems to be perfectly linearly correlated.

#### Example 3

Estimate the regression line through the following scatterplots.

Visualize and sketch the “line of best fit” for each set of points.

Note that in part a, the regression line does not touch any point. Instead, it captures the general trend of the data. In part c, the correlation is not high enough in any direction to produce a regression line. The calculator may give a regression line for scatterplots that look like part c, but you need to be very skeptical that there is actually a relationship between the two variables.

#### Example 4

Use your calculator to perform a linear regression on the following data. Then, predict the height of someone who has shoe size 9.

 Shoe Size Height (in) 11 70 8.5 70 10 72 8 65 7 64

First enter the data.

Next perform the regression. Notice that the calculator can perform linear regression in two ways that are essentially the same. To keep consistent with \begin{align*}\hat{y}=a+bx\end{align*}, use linear regression. This is option 8 in the [STATS], [CALC] menu.

Now you need to tell the calculator to perform the regression on the two lists you want and where to copy the equation. The syntax is:

• \begin{align*}\text{LinReg}(a+bx) L_1, L_2, Y_1\end{align*}

Note: to find Y1, go to -- [VARS], [Y-VARS], [FUNCTION], [Y1].

Notice that the \begin{align*}r\end{align*} value is about 0.8. This indicates that there is a fairly strong positive correlation between shoe size and height. If you calculator does not display the \begin{align*}r\end{align*} and \begin{align*}r^2\end{align*} lines then you need to go into the catalog and run the program “DiagnosticOn”. This will enable the display of the correlation coefficient.

You can then graph the scatterplot and the regression line:

The regression equation is:

\begin{align*}\hat{y}=52.4069+1.7745 x\end{align*}

Where \begin{align*}x\end{align*} represents shoe size and \begin{align*}\hat{y}\end{align*} represents predicted height. The predicted height for someone with size 9 shoe is 68.3774:

\begin{align*}\hat{y}=52.4069+1.7745 \cdot 9=68.3774\end{align*}

An easy way to use the power of the calculator is to use function notation from the home screen:

#### Example 5

Shaquille O’Neal has size 23 shoes. What, if anything can you infer about his vocabulary? Does a larger shoe size cause a larger vocabulary?

Shaquille’s shoe size is significantly beyond the scope of the data that the model is based on. The data relates to elementary school students and a size 23 shoe is beyond the relevant domain. This means it wouldn’t make sense to use this model to predict Shaquille’s shoe size. Shoe size does not cause vocabulary, but the two variables are strongly correlated because over time both tend to grow.

### Review

For each correlation coefficient, describe what it means for data to have that correlation coefficient and sketch a scatterplot with that correlation coefficient.

1. \begin{align*}r=1\end{align*}

2. \begin{align*}r=-0.5\end{align*}

3. \begin{align*}r=-1\end{align*}

4. \begin{align*}r=0\end{align*}

5. \begin{align*}r=0.8\end{align*}

The data below shows the SAT math score and GPA for 7 different students.

 SAT math score 595 520 715 405 680 490 565 GPA 3.4 3.2 3.9 2.3 3.9 2.5 3.5

6. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?

7. Use the equation from #6 to predict the GPA for a student with an SAT score of 500. Does this prediction seem reasonable given the data? Why or why not?

8. What is the relevant domain of this data?

9. Does a high SAT math score cause a high GPA?

The data below shows scores from two different quizzes for 10 different students.

 Quiz 1 Score 15 12 10 14 10 8 6 15 16 13 Quiz 2 Score 20 15 12 18 10 13 12 10 18 15

10. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?

11. Use the equation from #10 to predict the Quiz 2 score for a student with a Quiz 1 score of 19. Does this prediction seem reasonable given the data? Why or why not?

13. Explain in your own words the difference between causation and correlation.

14. Explain in your own words what the correlation coefficient measures.

15. Explain why a larger sample size will cause a more accurate correlation coefficient.

To see the Review answers, open this PDF file and look for section 15.7.

### Notes/Highlights Having trouble? Report an issue.

Color Highlighted Text Notes

### Vocabulary Language: English

TermDefinition
bivariate data Bivariate data consists of two paired sets of data.
correlation coefficient The correlation coefficient is a standard quantitative measure of best fit of a line. It has the symbol r and has values from -1 to +1.
deterministic A deterministic relationship indicates that the value of one variable can be reliably and accurately determined by the manipulation of the other variable.
explanatory variables Explanatory variables are another name for independent variables.
linear correlation Linear correlation is a measure of the strength of the linear relationship between two random variables.
linear correlation coefficient A  linear correlation coefficient  or r -value of a relationship between two variables describes the strength of the linear relationship.
response variables Response variables are another name for dependent variables.
scatter plot A scatter plot is a plot of the dependent variable versus the independent variable and is used to investigate whether or not there is a relationship or connection between 2 sets of data.
Scatterplot A scatterplot is a type of visual display that shows pairs of data for two different variables.
Slope Slope is a measure of the steepness of a line. A line can have positive, negative, zero (horizontal), or undefined (vertical) slope. The slope of a line can be found by calculating “rise over run” or “the change in the $y$ over the change in the $x$.” The symbol for slope is $m$
Slope-Intercept Form The slope-intercept form of a line is $y = mx + b,$ where $m$ is the slope and $b$ is the $y-$intercept.