<img src="https://d5nxst8fruw4z.cloudfront.net/atrk.gif?account=iA1Pi1a8Dy00ym" style="display:none" height="1" width="1" alt="" />
You are viewing an older version of this Concept. Go to the latest version.

# Regression and Correlation

## Scatterplots, relationship between data, correlation coefficients, and regression.

0%
Progress
Practice Regression and Correlation
Progress
0%
Linear Correlation

Statistics is largely concerned with how a change in one variable relates to changes in a second variable. Bivariate data is two lists of data that are paired up. Is there any relationship between the following data? If there is, does it mean that doctors cause cancer?

 Number of Doctors 27 30 36 60 81 90 156 221 347 Cancer Rate 0.02 0.07 0.16 0.2 0.43 0.87 1.21 2.8 3.91

#### Guidance

A scatterplot creates an $(x, y)$ point from each data pair. When making a scatterplot, you can try to assign the independent variable to  $x$ and the dependent variable to $y$ ; however, it will often not be obvious which variable is the dependent variable, so you will just have to pick one.

Once you plot the data and zoom appropriately you will see the points scattered about. Sometimes there will be a clear linear relationship and sometimes it will appear random. The correlation coefficient , $r$ , is a number that quantifies two aspects of the relationship between the data:

• The correlation coefficient is either negative, zero or positive. This tells you whether the data is negatively correlated, uncorrelated or positively correlated.
• The correlation coefficient is a number between  $-1 \le r \le 1$ indicating the strength of correlation. If  $r=1$ or  $r=-1$ then the data is perfectly linear. Note that a perfectly linear relationship includes lines with slopes other than 1.

Consider the examples below to see what different correlation coefficients will look like in data:

In PreCalculus you will not learn how to calculate the correlation coefficient (you will if you take future statistics courses!). For now, the calculator will calculate it for you and your job will be to interpret the result. See Example C.

If the data is sufficiently linear, then your calculator can perform a regression to produce the equation of a line that attempts to model the trend of the data. The regression line may actually pass through all, some or none of the data points. This regression line is represented in statistics by:

$\hat {y}=a+bx$

The symbol  $\hat{y}$ is pronounced “ $y$ -hat” and is the predicted  $y$ value based on a given  $x$ value. Occasionally, you may also calculate the predicted  $x$ value given a  $y$ value, however this is less mathematically sound. Also notice that the linear regression model is simply a rearrangement of the standard equation of a line, $y=mx+b$

Example A

Estimate the correlation coefficient for the following scatterplots.

Solution:

1. $r \approx 0$ . Because the height  $(y)$ does not seem to be dependent on the $x$ , the data is uncorrelated. Another way to see this is that the slope appears to be undefined.
2. $r \approx -0.7$ . If the solo point in the bottom left is an outlier, you could choose to not include it in the data. Then, the  $r$ value would be closer to -1.
3. $r \approx +0.8$ . The clump of data seems to be slightly positive correlated and the single point in the upper left has a strong effect indicating positive slope.
4. $r \approx -0.8$ . The data seems to be fairly strongly negatively correlated.
5. $r \approx 1$ . The data seems to be perfectly linearly correlated.

Example B

Estimate the regression line through the following scatterplots.

Solution: Visualize and sketch the “line of best fit” for each set of points.

Note that in part a, the regression line does not touch any point. Instead, it captures the general trend of the data. In part c, the correlation is not high enough in any direction to produce a regression line. The calculator may give a regression line for scatterplots that look like part c, but you need to be very skeptical that there is actually a relationship between the two variables.

Example C

Use your calculator to perform a linear regression on the following data. Then, predict the height of someone who has shoe size 9.

 Shoe Size Height (in) 11 70 8.5 70 10 72 8 65 7 64

Solution: First enter the data.

Next perform the regression. Notice that the calculator can perform linear regression in two ways that are essentially the same. To keep consistent with $\hat{y}=a+bx$ , use linear regression. This is option 8 in the [STATS], [CALC] menu.

Now you need to tell the calculator to perform the regression on the two lists you want and where to copy the equation. The syntax is:

• $\text{LinReg}(a+bx) L_1, L_2, Y_1$

Note: to find Y1, go to -- [VARS], [Y-VARS], [FUNCTION], [Y1].

Notice that the  $r$ value is about 0.8. This indicates that there is a fairly strong positive correlation between shoe size and height. If you calculator does not display the  $r$ and  $r^2$ lines then you need to go into the catalog and run the program “DiagnosticOn”. This will enable the display of the correlation coefficient.

You can then graph the scatterplot and the regression line:

The regression equation is:

$\hat{y}=52.4069+1.7745 x$

Where  $x$ represents shoe size and  $\hat{y}$ represents predicted height. The predicted height for someone with size 9 shoe is 68.3774:

$\hat{y}=52.4069+1.7745 \cdot 9=68.3774$

An easy way to use the power of the calculator is to use function notation from the home screen:

Concept Problem Revisited

Enter the data onto lists in your calculator:

Turn the [STAT PLOT] on that compares the two lists of data:

You should note that the data is extremely linear with a positive correlation coefficient:

A naïve conclusion would be to say that doctors cause cancer. One of the most misunderstood concepts in statistics is that correlation does not imply causation. Just because there is a correlation between the number of doctors and the cancer rate doesn’t mean that the number of doctors causes the cancer. There are dozens of reasons why more doctors might correlate with higher cancer rates. In general, remember that correlation is not the same as causation. Be careful before making any conclusions about change in one variable causing change in another variable.

#### Vocabulary

A scatterplot creates an  $(x, y)$ point from each data pair.

Bivariate data is two sets of data that are paired.

The correlation coefficient , $r$ , is a number in the interval [-1, 1]. It indicates the strength of the correlation between two variables.

#### Guided Practice

1. The data below represents the average number of working words in an elementary student’s vocabulary as it relates to their shoe size. Perform a linear regression that models the data.

 Shoe Size 1 1.5 2 2.5 3 3.5 4 4.5 Vocabulary 1135 1983 2501 4113 5431 7891 9320 11041

2. Use the equation from Guided Practice 1 to predict the vocabulary for someone who has a 1.0 shoe size. Does this prediction seem reasonable given the data? Why or why not?

3. Shaquille O’Neal has size 23 shoes. What, if anything can you infer about his vocabulary? Does a larger shoe size cause a larger vocabulary?

1. Let  $x$ represent shoe size and  $y$ represent vocabulary.

$\hat{y}=-2660.4167+2940.9333 x$

$r=0.9865$

The correlation coefficient is very close to positive one. This is a strong indication that the data can be modeled by a linear relationship.

2. $\hat{y}=-2660.4167+2940.9333 \cdot 1$

$\hat{y}=280.4167$

This number seems remarkably low considering the data. This point is very close to the  $x$ intercept, which can be found using algebra:

$0=-2660.4167+2940.9333 x$

$0.9046=x$

The interpretation of the point (0.9046, 0) from the model is that when a person has a shoe size of just under 1.0, then their predicted vocabulary is zero. Shoe sizes below 0.9046 will have a negative vocabulary. Is this reasonable? It certainly does not make sense that someone could have a negative number of words in their vocabulary. Newborn babies are born without knowing any words and this number stays flat at 0 for some length of time. Therefore, this model is not accurate for very low shoe sizes.

3. Shaquille’s shoe size is significantly beyond the scope of the data that the model is based on. The data relates to elementary school students and a size 23 shoe is beyond the relevant domain. This means it wouldn’t make sense to use this model to predict Shaquille’s shoe size. Shoe size does not cause vocabulary, but the two variables are strongly correlated because over time both tend to grow.

#### Practice

For each correlation coefficient, describe what it means for data to have that correlation coefficient and sketch a scatterplot with that correlation coefficient.

1. $r=1$

2. $r=-0.5$

3. $r=-1$

4. $r=0$

5. $r=0.8$

The data below shows the SAT math score and GPA for 7 different students.

 SAT math score 595 520 715 405 680 490 565 GPA 3.4 3.2 3.9 2.3 3.9 2.5 3.5

6. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?

7. Use the equation from #6 to predict the GPA for a student with an SAT score of 500. Does this prediction seem reasonable given the data? Why or why not?

8. What is the relevant domain of this data?

9. Does a high SAT math score cause a high GPA?

The data below shows scores from two different quizzes for 10 different students.

 Quiz 1 Score 15 12 10 14 10 8 6 15 16 13 Quiz 2 Score 20 15 12 18 10 13 12 10 18 15

10. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?

11. Use the equation from #10 to predict the Quiz 2 score for a student with a Quiz 1 score of 19. Does this prediction seem reasonable given the data? Why or why not?

13. Explain in your own words the difference between causation and correlation.

14. Explain in your own words what the correlation coefficient measures.

15. Explain why a larger sample size will cause a more accurate correlation coefficient.

### Vocabulary Language: English

bivariate data

bivariate data

Bivariate data consists of two paired sets of data.
correlation coefficient

correlation coefficient

The correlation coefficient is a standard quantitative measure of best fit of a line. It has the symbol r and has values from -1 to +1.
deterministic

deterministic

A deterministic relationship indicates that the value of one variable can be reliably and accurately determined by the manipulation of the other variable.
explanatory variables

explanatory variables

Explanatory variables are another name for independent variables.
linear correlation

linear correlation

Linear correlation is a measure of the strength of the linear relationship between two random variables.
linear correlation coefficient

linear correlation coefficient

A  linear correlation coefficient  or r -value of a relationship between two variables describes the strength of the linear relationship.
response variables

response variables

Response variables are another name for dependent variables.
scatter plot

scatter plot

A scatter plot is a plot of the dependent variable versus the independent variable and is used to investigate whether or not there is a relationship or connection between 2 sets of data.
Scatterplot

Scatterplot

A scatterplot is a type of visual display that shows pairs of data for two different variables.
Slope

Slope

Slope is a measure of the steepness of a line. A line can have positive, negative, zero (horizontal), or undefined (vertical) slope. The slope of a line can be found by calculating “rise over run” or “the change in the $y$ over the change in the $x$.” The symbol for slope is $m$
Slope-Intercept Form

Slope-Intercept Form

The slope-intercept form of a line is $y = mx + b,$ where $m$ is the slope and $b$ is the $y-$intercept.