Statistics is largely concerned with how a change in one variable relates to changes in a second variable. Bivariate data is two lists of data that are paired up. Is there any relationship between the following data? If there is, does it mean that doctors cause cancer?
Number of Doctors | 27 | 30 | 36 | 60 | 81 | 90 | 156 | 221 | 347 |
Cancer Rate | 0.02 | 0.07 | 0.16 | 0.20 | 0.43 | 0.87 | 1.21 | 2.80 | 3.91 |
Watch This
http://www.youtube.com/watch?v=ROpbdO-gRUo Khan Academy: Correlation vs. Causality
Guidance
A scatterplot creates an point from each data pair. When making a scatterplot, you can try to assign the independent variable to and the dependent variable to ; however, it will often not be obvious which variable is the dependent variable, so you will just have to pick one.
Once you plot the data and zoom appropriately you will see the points scattered about. Sometimes there will be a clear linear relationship and sometimes it will appear random. The correlation coefficient , , is a number that quantifies two aspects of the relationship between the data:
- The correlation coefficient is either negative, zero or positive. This tells you whether the data is negatively correlated, uncorrelated or positively correlated.
- The correlation coefficient is a number between indicating the strength of correlation. If or then the data is perfectly linear. Note that a perfectly linear relationship includes lines with slopes other than 1.
Consider the examples below to see what different correlation coefficients will look like in data:
In PreCalculus you will not learn how to calculate the correlation coefficient (you will if you take future statistics courses!). For now, the calculator will calculate it for you and your job will be to interpret the result. See Example C.
If the data is sufficiently linear, then your calculator can perform a regression to produce the equation of a line that attempts to model the trend of the data. The regression line may actually pass through all, some or none of the data points. This regression line is represented in statistics by:
The symbol is pronounced “ -hat” and is the predicted value based on a given value. Occasionally, you may also calculate the predicted value given a value, however this is less mathematically sound. Also notice that the linear regression model is simply a rearrangement of the standard equation of a line, .
Example A
Estimate the correlation coefficient for the following scatterplots.
Solution:
- . Because the height does not seem to be dependent on the , the data is uncorrelated. Another way to see this is that the slope appears to be undefined.
- . If the solo point in the bottom left is an outlier, you could choose to not include it in the data. Then, the value would be closer to -1.
- . The clump of data seems to be slightly positive correlated and the single point in the upper left has a strong effect indicating positive slope.
- . The data seems to be fairly strongly negatively correlated.
- . The data seems to be perfectly linearly correlated.
Example B
Estimate the regression line through the following scatterplots.
Solution: Visualize and sketch the “line of best fit” for each set of points.
Note that in part a, the regression line does not touch any point. Instead, it captures the general trend of the data. In part c, the correlation is not high enough in any direction to produce a regression line. The calculator may give a regression line for scatterplots that look like part c, but you need to be very skeptical that there is actually a relationship between the two variables.
Example C
Use your calculator to perform a linear regression on the following data. Then, predict the height of someone who has shoe size 9.
Shoe Size | Height (in) |
11 | 70 |
8.5 | 70 |
10 | 72 |
8 | 65 |
7 | 64 |
Solution: First enter the data.
Next perform the regression. Notice that the calculator can perform linear regression in two ways that are essentially the same. To keep consistent with , use linear regression. This is option 8 in the [STATS], [CALC] menu.
Now you need to tell the calculator to perform the regression on the two lists you want and where to copy the equation. The syntax is:
Note: to find Y1, go to -- [VARS], [Y-VARS], [FUNCTION], [Y1].
Notice that the value is about 0.8. This indicates that there is a fairly strong positive correlation between shoe size and height. If you calculator does not display the and lines then you need to go into the catalog and run the program “DiagnosticOn”. This will enable the display of the correlation coefficient.
You can then graph the scatterplot and the regression line:
The regression equation is:
Where represents shoe size and represents predicted height. The predicted height for someone with size 9 shoe is 68.3774:
An easy way to use the power of the calculator is to use function notation from the home screen:
Concept Problem Revisited
Enter the data onto lists in your calculator:
Turn the [STAT PLOT] on that compares the two lists of data:
You should note that the data is extremely linear with a positive correlation coefficient:
A naïve conclusion would be to say that doctors cause cancer. One of the most misunderstood concepts in statistics is that correlation does not imply causation. Just because there is a correlation between the number of doctors and the cancer rate doesn’t mean that the number of doctors causes the cancer. There are dozens of reasons why more doctors might correlate with higher cancer rates. In general, remember that correlation is not the same as causation. Be careful before making any conclusions about change in one variable causing change in another variable.
Vocabulary
A scatterplot creates an point from each data pair.
Bivariate data is two sets of data that are paired.
The correlation coefficient , , is a number in the interval [-1, 1]. It indicates the strength of the correlation between two variables.
Guided Practice
1. The data below represents the average number of working words in an elementary student’s vocabulary as it relates to their shoe size. Perform a linear regression that models the data.
Shoe Size | 1.0 | 1.5 | 2.0 | 2.5 | 3.0 | 3.5 | 4.0 | 4.5 |
Vocabulary | 1135 | 1983 | 2501 | 4113 | 5431 | 7891 | 9320 | 11041 |
2. Use the equation from Guided Practice 1 to predict the vocabulary for someone who has a 1.0 shoe size. Does this prediction seem reasonable given the data? Why or why not?
3. Shaquille O’Neal has size 23 shoes. What, if anything can you infer about his vocabulary? Does a larger shoe size cause a larger vocabulary?
Answers:
1. Let represent shoe size and represent vocabulary.
The correlation coefficient is very close to positive one. This is a strong indication that the data can be modeled by a linear relationship.
2.
This number seems remarkably low considering the data. This point is very close to the intercept, which can be found using algebra:
The interpretation of the point (0.9046, 0) from the model is that when a person has a shoe size of just under 1.0, then their predicted vocabulary is zero. Shoe sizes below 0.9046 will have a negative vocabulary. Is this reasonable? It certainly does not make sense that someone could have a negative number of words in their vocabulary. Newborn babies are born without knowing any words and this number stays flat at 0 for some length of time. Therefore, this model is not accurate for very low shoe sizes.
3. Shaquille’s shoe size is significantly beyond the scope of the data that the model is based on. The data relates to elementary school students and a size 23 shoe is beyond the relevant domain. This means it wouldn’t make sense to use this model to predict Shaquille’s shoe size. Shoe size does not cause vocabulary, but the two variables are strongly correlated because over time both tend to grow.
Practice
For each correlation coefficient, describe what it means for data to have that correlation coefficient and sketch a scatterplot with that correlation coefficient.
1.
2.
3.
4.
5.
The data below shows the SAT math score and GPA for 7 different students.
SAT math score | 595 | 520 | 715 | 405 | 680 | 490 | 565 |
GPA | 3.4 | 3.2 | 3.9 | 2.3 | 3.9 | 2.5 | 3.5 |
6. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?
7. Use the equation from #6 to predict the GPA for a student with an SAT score of 500. Does this prediction seem reasonable given the data? Why or why not?
8. What is the relevant domain of this data?
9. Does a high SAT math score cause a high GPA?
The data below shows scores from two different quizzes for 10 different students.
Quiz 1 Score | 15 | 12 | 10 | 14 | 10 | 8 | 6 | 15 | 16 | 13 |
Quiz 2 Score | 20 | 15 | 12 | 18 | 10 | 13 | 12 | 10 | 18 | 15 |
10. Use your calculator to perform a linear regression that models the data. What is the regression equation? What is the correlation coefficient?
11. Use the equation from #10 to predict the Quiz 2 score for a student with a Quiz 1 score of 19. Does this prediction seem reasonable given the data? Why or why not?
12. What conclusions can you make about this data?
13. Explain in your own words the difference between causation and correlation.
14. Explain in your own words what the correlation coefficient measures.
15. Explain why a larger sample size will cause a more accurate correlation coefficient.