- Understand the properties of the linear correlation coefficient
- Estimate and interpret linear correlation coefficients
- Understand the difference between correlation and causation
- Identify possible lurking variables in bivariate data
- Understand the effects outliers and influential points can have on correlation
The Correlation Coefficient
The correlation coefficient is a statistic that measures the strength and direction of a linear relationship between two numeric variables. The symbol for correlation is r, and r can take any value from -1.0 to +1.0. The correlation coefficient (r) tells us two things about the linear relationship between the two variables, its strength and its direction. The direction of the relationship, positive or negative, is given by the sign of the r value. A positive value for r indicates that the relationship is positive (increasing to the right), and a negative r value indicates a negative relationship between the two variables (decreasing to the right). Bivariate data with a positive correlation tells us that as the explanatory variable increases, so does the response variable. And, bivariate data with a negative correlation tells us that as the explanatory variable increases, the response variable decreases. A correlation of zero indicates neither of these trends.
The second thing that the correlation coefficient tells us is the strength of the linear relationship - how close the points are to forming a perfect line. An r of exactly 1 or -1 has a perfect correlation, the relationship forms a perfect, exact line. An r value of exactly +1 means that the relationship forms a perfect line with a positive slope and a r value of exactly -1 means that the scatterplot will show a perfect line with a negative slope. The closer the correlation value is to either +1 or -1, the stronger the linear relationship is. And, as r gets closer to zero (either positive or negative), the weaker the linear relationship is. It is important to note that this is only measuring the linear relationship between the two variables. If the relationship shows a clear curved pattern for example, the correlation will tell us nothing about the strength of the relationship.
Here are some sample scatterplots with their correlation coefficients given:
We will be using either our calculator or a computer to calculate the correlation coefficient. The formula to calculate the correlation coefficient is quite tedious. It involves calculating the mean and standard deviation of all of the x-values and the mean and standard deviation of all of the y-values. It then compares the x-value of each ordered pair to the mean of x and every y-value to the mean of y (by subtracting and then dividing by the standard deviation), multiplies these newly calculated values, adds all of them, and divides by one less than the sample size. The correlation formula is shown below, but we will be using technology rather than calculating by hand. See appendix for calculator instructions.
Estimate the correlation coefficient for each of the following scatterplots.
Nevada: The correlation will be negative and fairly strong, so my estimate is r ≈ -0.85.
Height & IQ: There seems to be no pattern to the graph, so my estimate is r ≈ 0.
Properties of Correlation
When considering using correlation as a measure of the strength between two variables, you should construct and examine a scatterplot first. It is important to check for outliers, be sure that the relationship appears to be linear, be sure that your sample size is sufficient, and consider whether the individuals being examined were too much alike in some way to begin with. Thus, when examining correlation, there are four things that could affect our results: outliers, linearity, size of the sample and homogeneity of the group.
An outlier, or a data point that lies outside of our overall pattern, can have a great effect on correlation. How great of an affect is determined by the sample size of the data and by the magnitude by which the outlier lies outside of the pattern. The three plots below show scatterplots with their correlation coefficients (r). The first plot shows a positive and reasonably linear graph. Its correlation is r = .897, which is positive and fairly strong. The second plot shows the same data as plot one, with one outlier (upper left) added. Its correlation has dropped to r = .374, which is still positive, but much weaker. This demonstrates how outliers can bring the correlation closer to zero. However, some outliers can actually strengthen the correlation. This is demonstrated in the third plot, which shows the same data as the first with one outlier (upper right) added. With this outlier, the linear relationship becomes even stronger than the first plot, at r = .973.
If the relationship is not linear, calculating the correlation coefficient is meaningless. It is only testing the linear relationship between the two variables. Imagine a scatterplot that shows a perfect parabolic relationship. We would know that there is a strong relationship between these two variables, but if we calculated the correlation coefficient, we would arrive at a figure around zero. Therefore, the correlation coefficient is not always the best statistic to use to understand the relationship between variables.
As we discussed in experimental design, a small sample size can be misleading. It can either appear to have a stronger or weaker relationship than is really accurate. The larger the sample, the more accurate of a predictor the correlation coefficient will be on the linear relationship between the two variables.
When a group is too much alike in regard to some characteristics (homogeneous), the range of scores on either or both variables is restricted. For example, suppose we are interested in finding out the correlation between IQ and salary. If only members of the Mensa Club (a club for people with IQ's over 140) are sampled, we will most likely find a very low correlation between IQ and salary since most members will have a consistently high IQ, but their salaries will vary. This does not mean that there is not a relationship – it simply means that the restriction of the sample limited the magnitude of the correlation coefficient.
Correlation is just a number, it has no units. Also, a change in units of measurement will not affect the correlation. For example, suppose that you had measured several people's heights to the nearest inch and weight to the nearest pound and calculated the correlation coefficient. If you were to then convert the heights to centimeters or weights to kilograms, or both, and then calculate the correlation again, it would be the same value.
It is very important to know that a high correlation does not mean causation! Often times studies that showing a high correlation between two variables will influence readers into thinking that one variable is the cause of the relationship. This is not always true! A high correlation simply does not prove that one variable is causing the other. In some situations we would agree that one variable is in fact causing the response in another. The best way to prove such a direct cause-and-effect relationship is by carrying out a well designed experiment. For example, smoking is strongly correlated with lung disease, and, based on much scientific evidence, we can now say that cigarette smoking causes lung disease. However, this topic was highly debated for many years before the surgeon general announced that it was accepted that cigarette smoking causes lung cancer and emphysema. Many people refused to accept this for many years. People who stood to lose money if smoking was proven to be unsafe, suggested every possible other explanation that they could think of. They suggested that it was simply a coincidence, or that all people who choose to smoke might have something else in common that was actually the cause of the lung disease, not the cigarettes. Because it was not ethical to experiment on humans in order to prove the direct cause-and-effect relationship, the debates went on for a long time.
Sometimes the relationship between variables is a cause-and-effect one, but many times it can be simply a coincidence that the two variables are highly correlated. It is also possible that some other outside factor, a lurking variable, is causing both variables to change. A situation where we have two variables that are both being affected by some other, outside, lurking variable is called common response. For example, we can show a high correlation between the number of TV's per household and the life expectancy per person among many countries. However, it makes no sense that TV's cause people to live longer. Some lurking variable is having an effect here. It is likely that the economic status of the countries is causing both variables to change: more money means more TV's and more money means better health care. If a country is wealthy it is much more likely to have citizens who own TV's. Also, if a country is wealthy it is much more likely to have good hospitals, roads, health education, access to clean water and food, all things that contribute to longer life.
In some situations we will have two variables that are highly correlated, but we are unsure of the exact cause of the relationship. We may be unclear as to whether or not one is causing the other, if there is a lurking variable causing a common response, or if there is some unknown lurking variable that is related in some other unknown way (lurking variables are not always obvious to the researchers). Such a situation is called confounding, because it is confusing to determine how the variables are related (if at all), and whether there may be some lurking variable and if it is related to the variables in question. The variables seem all mixed up and the relationship is unclear, even if highly correlated. An example of confounding is global warming. This is a highly debated topic in social media and web-blogs. Some people argue that human pollution is a major cause of the increase in CO2 and other green house gasses in the atmosphere. While others argue that it is a part of a natural cycle that has normally occurred in our Earth's history. Still some may think both explanations are at work. This is an example of confounding because there is confusion about the cause of global warming.
And don't forget that some relationships are occurring completely by chance, and their high correlation is then just a coincidence. For example, if you researched divorce rates and gas prices over the past 50 years you may note that both have gone up. A scatterplot comparing divorce rates and gas prices would show a strong positive relationship. The correlation would likely be a high, positive value. However, it makes no sense that divorce rates are causing high gas prices. It also is unlikely that there exists a common response or some form of confounding. So in this case, we would say that this is a relationship that is best explained by sheer coincidence.
Suggest possible lurking variables to explain the high correlations between the following variables. Explain your reasoning. Consider whether common response, confounding, or coincidence may be involved.
a) It has been shown that cities with more police officers also have higher numbers of violent crimes. Does this mean that more police officers are causing more violent crimes to occur?
b) Over the past 25 years, the percent of parents using car-seats has increased significantly. During this same time period, the rate of DUI arrests has also increased significantly. These two variables, when graphed, show a very high, positive correlation. Does this mean that car-seat use is causing DUI's to increase?
c) A study published in USA Today claimed that, "Teens who text a lot [are] more likely to try sex, drugs, alcohol." Does this mean that texting causes teens to try sex, drugs and alcohol? Could we then limit teen behaviors such as these by canceling their texting plans?
a) It makes no sense that the number of police officers would be causing the violent crime to occur. It is much more likely that it is the reverse, that communities with high numbers of violent crimes need higher numbers of police officers. It is also probable that both variables increase in cities with higher populations. Due to the fact that we can think of more than one possible lurking variable and it is difficult to know how all of these variables actually relate, we would say that this is an example of confounding (the variables in question and the lurking variables are all mixed up).
b) It is clearly ridiculous to think that car-seat use is causing an increase in the rate of DUI's. It also makes no sense that DUI's cause car-seats to be used. It may be simply a coincidence that these are both increasing. Or, perhaps there has been in increase in law enforcement for both over this time period. The awareness of the dangers of both have increased over the past 25 years, so maybe this is an example of common response. Or, maybe many factors contribute to the increase of both, so perhaps this is an example of confounding. But, no matter what, this is not cause-and-effect.
c) It is unlikely that texting is actually the cause of these behaviors. There is most likely some other, lurking variable(s) that are the cause(s). One probable lurking variable, when it comes to teenagers, is the parents. Perhaps this is an example of a common response to parents who are not very involved in their teens' lives. Parents who are not very involved would not be aware that their teen is texting too much and would also not be aware of what choices their teen is making during his or her free time. Perhaps teens who spend a lot of time unsupervised would be more likely to text and would also be more likely to try sex, drugs, and alcohol. All of these behaviors might be a common response to not having parents who prohibit or limit teens from doing these things. Canceling texting plans would have little to no affect on other teen behaviors.
See the link for more information on this report at: http://www.usatoday.com/yourlife/sex-relationships/2010-11-10-texting-teens_N.htm
Calculating Correlation on the Internet,
There are several websites where you can enter in data points and find their correlation one of them is below.
If this site no longer works, trying googling "finding correlation applet" and see what you get for results.
For an explanation of the correlation coefficient,
see kbower50, The Correlation Coefficient (3:59).
Another, more lighthearted example of Correlation ≠ Causation can be found at the following website, which discusses the evil of the pickle.
For a better understanding of correlation try these fun links below,
http://www.istics.net/stat/Correlations Match the graph to its correlation.
http://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html Guess the correlation Guess the correlation
Problem Set 6.2
Section 6.2 Exercises
1) What are the two things that the correlation coefficient measures?
2) The program used to create this scatterplot found the line-of-best-fit and reported the r-squared value as r2 = 0.805 for the relationship between arm-span and height for several individuals. What is the correlation coefficient? Is it positive or negative? Explain how you know.
3) During the summer Ms. Statsteacher lets her two daughters stay up later than during the school year. Their bedtimes during the summer range from 8:30 p.m. to 12:30 a.m. She has discovered that her older daughter Reily will wake up between 8:00 and 9:00 a.m. no matter what time she goes to bed. However, her younger daughter Neila tends to wake up later after she gets to stay up later, and earlier when she goes to bed earlier. Neila has been known to wake up anytime between 8:00 and 11:45 a.m.
a) Sketch a separate (approximate) scatterplot for each daughter, that compares time going to sleep and time waking up. Which will be explanatory and which will be response?
b) Which of these do you think will best approximate the correlation for Reily?
A. close to r = +1
B. close to r = +.75
C. close to r = 0
D. close to r = -.75
E. close to r = -1
c) Which of these do you think will best approximate the correlation for Neila?
A. close to r = +1
B. close to r = +.75
C. close to r = 0
D. close to r = -.75
E. close to r = -1
4) Suggest possible lurking variables to explain the high correlations between the following variables. Explain your reasoning. Consider whether common response, confounding, or coincidence may be involved.
a) As ice cream sales increase, the rate of drowning deaths increases sharply. Does this mean that ice cream causes drowning?
b) With a decrease in the number of pirates, there has been an increase in global warming over the same time period. Does this mean global warming is caused by a lack of pirates?
c) The higher the number of fire-fighters fighting a fire, the more damage done by the fire. Does this mean that we can limit damage by sending fewer fire-fighters to fires?
d) Suppose that each of the hockey players on the high school team supplies his or her own hockey stick, with varying degrees of flex. The assistant coach has been keeping a record of the degree of flex for each player's stick and their respective point totals (goals and assists). He has noted that there is a strong, negative correlation between these two variables. In other words, the players with less flex in their sticks are scoring more points and those with more flex are scoring fewer points. Does this prove that the amount of flex in a stick will cause the point totals for the players? Can we then give players less flexible sticks and expect to increase scoring?
5) In a recent study in Resource Manual, it was noted that divorced men were twice as likely to abuse alcohol as married men. The authors concluded that getting divorced caused alcohol abuse. Do you agree? Explain your reasoning.
6) A commercial for a new diet pill claims "You will lose weight while you sleep! No exercise needed!". They then show several before-and-after photos of people who have lost weight. People who were obese are now very buff. They then give the information for you to order the pills ("for three payments of just $19.95 each, plus shipping and handling"). Is this proof that these diet pills caused these people to lose weight? Suggest possible lurking variables. Explain your reasoning.
7) Match each graph with its correlation coefficient:
8) A correlation of r = 0 indicates no linear relationship between the two given variables. But, this does not mean that there is no relationship between the two variables. Sketch a scatterplot in which there is a strong relationship between the variables, but the correlation would be near r = 0.
9) Use the "Beach Visitors" scatterplot to answer the questions that follow.
a) Identify the explanatory and response variables.
b) Estimate the correlation coefficient for the graph.
c) Describe what the scatterplot shows. (remember S.C.O.F.D)
10) Zeke flips a coin 93 times and tails shows up 34 of those times. Based on these results, what is the experimental probability of getting tails?
11) If Stephanie's batting average is 0.258, how many hits would you expect her to get out of her next 20 times at bat?
12) You have been playing the game Yahtzee with some friends and you have been keeping track of how often someone gets a Yahtzee (5 of the same dice) when they roll all 5 dice at once. The results today have been 3 Yahtzee's, on a single roll, out of 79 trials. Based on these results, what is the experimental probability of getting a Yahtzee in one roll?
13) What is the theoretical probability of getting a Yahtzee in one roll?