9.1: Scatterplots and Linear Correlation
Learning Objectives
- Understand the concepts of bivariate data and correlation, and the use of scatterplots to display bivariate data.
- Understand when the terms 'positive', 'negative', 'strong', and 'perfect' apply to the correlation between two variables in a scatterplot graph.
- Calculate the linear correlation coefficient and coefficient of determination of bivariate data, using technology tools to assist in the calculations.
- Understand properties and common errors of correlation.
Introduction
So far we have learned how to describe distributions of a single variable and how to perform hypothesis tests concerning parameters of these distributions. But what if we notice that two variables seem to be related? We may notice that the values of two variables, such as verbal SAT score and GPA, behave in the same way and that students who have a high verbal SAT score also tend to have a high GPA (see table below). In this case, we would want to study the nature of the connection between the two variables.
Student | SAT Score | GPA |
---|---|---|
1 | 595 | 3.4 |
2 | 520 | 3.2 |
3 | 715 | 3.9 |
4 | 405 | 2.3 |
5 | 680 | 3.9 |
6 | 490 | 2.5 |
7 | 565 | 3.5 |
These types of studies are quite common, and we can use the concept of correlation to describe the relationship between the two variables.
Bivariate Data, Correlation Between Values, and the Use of Scatterplots
Correlation measures the relationship between bivariate data. Bivariate data are data sets in which each subject has two observations associated with it. In our example above, we notice that there are two observations (verbal SAT score and GPA) for each subject (in this case, a student). Can you think of other scenarios when we would use bivariate data?
If we carefully examine the data in the example above, we notice that those students with high SAT scores tend to have high GPAs, and those with low SAT scores tend to have low GPAs. In this case, there is a tendency for students to score similarly on both variables, and the performance between variables appears to be related.
Scatterplots display these bivariate data sets and provide a visual representation of the relationship between variables. In a scatterplot, each point represents a paired measurement of two variables for a specific subject, and each subject is represented by one point on the scatterplot.
Correlation Patterns in Scatterplot Graphs
Examining a scatterplot graph allows us to obtain some idea about the relationship between two variables.
When the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), we say that there is a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.
When the points on a scatterplot graph produce a upper-left-to-lower-right pattern (see below), we say that there is a negative correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be low, and vice versa.
When all the points on a scatterplot lie on a straight line, you have what is called a perfect correlation between the two variables (see below).
A scatterplot in which the points do not have a linear trend (either positive or negative) is called a zero correlation or a near-zero correlation (see below).
When examining scatterplots, we also want to look not only at the direction of the relationship (positive, negative, or zero), but also at the magnitude of the relationship. If we drew an imaginary oval around all of the points on the scatterplot, we would be able to see the extent, or the magnitude, of the relationship. If the points are close to one another and the width of the imaginary oval is small, this means that there is a strong correlation between the variables (see below).
However, if the points are far away from one another, and the imaginary oval is very wide, this means that there is a weak correlation between the variables (see below).
Correlation Coefficients
While examining scatterplots gives us some idea about the relationship between two variables, we use a statistic called the correlation coefficient to give us a more precise measurement of the relationship between the two variables. The correlation coefficient is an index that describes the relationship and can take on values between \begin{align*}-1.0\end{align*}
The absolute value of the coefficient indicates the magnitude, or the strength, of the relationship. The closer the absolute value of the coefficient is to 1, the stronger the relationship. For example, a correlation coefficient of 0.20 indicates that there is a weak linear relationship between the variables, while a coefficient of \begin{align*}-0.90\end{align*}
The value of a perfect positive correlation is 1.0, while the value of a perfect negative correlation is \begin{align*}-1.0\end{align*}
When there is no linear relationship between two variables, the correlation coefficient is 0. It is important to remember that a correlation coefficient of 0 indicates that there is no linear relationship, but there may still be a strong relationship between the two variables. For example, there could be a quadratic relationship between them.
On the Web
http://tinyurl.com/ylcyh88 Match the graph to its correlation.
http://tinyurl.com/y8vcm5y Guess the correlation.
http://onlinestatbook.com/stat_sim/reg_by_eye/index.html Regression by eye.
The Pearson product-moment correlation coefficient is a statistic that is used to measure the strength and direction of a linear correlation. It is symbolized by the letter \begin{align*}r\end{align*}
Pearson used standard scores (\begin{align*}z\end{align*}
Therefore, the formula for this coefficient is as follows:
\begin{align*}r_{XY} = \frac{\sum z_X z_Y}{n-1}\end{align*}
In other words, the coefficient is expressed as the sum of the cross products of the standard \begin{align*}z\end{align*}
An equivalent formula that uses the raw scores rather than the standard scores is called the raw score formula and is written as follows:
\begin{align*}r_{XY}=\frac{n\sum xy-\sum x \sum y}{\sqrt{\left [ n\sum x^2- \left ( \sum x \right )^2 \right ]} \sqrt{\left [ n \sum y^2-\left ( \sum y \right )^2 \right ]}}\end{align*}
Again, this formula is most often used when calculating correlation coefficients from original data. Note that \begin{align*}n\end{align*} is used instead of \begin{align*}n - 1\end{align*}, because we are using actual data and not \begin{align*}z\end{align*}-scores. Let’s use our example from the introduction to demonstrate how to calculate the correlation coefficient using the raw score formula.
Example: What is the Pearson product-moment correlation coefficient for the two variables represented in the table below?
Student | SAT Score | GPA |
---|---|---|
1 | 595 | 3.4 |
2 | 520 | 3.2 |
3 | 715 | 3.9 |
4 | 405 | 2.3 |
5 | 680 | 3.9 |
6 | 490 | 2.5 |
7 | 565 | 3.5 |
In order to calculate the correlation coefficient, we need to calculate several pieces of information, including \begin{align*}xy, \ x^2\end{align*}, and \begin{align*}y^2\end{align*}. Therefore, the values of \begin{align*}xy, \ x^2\end{align*}, and \begin{align*}y^2\end{align*} have been added to the table.
Student | SAT Score \begin{align*}(X)\end{align*} | GPA \begin{align*}(Y)\end{align*} | \begin{align*}xy\end{align*} | \begin{align*}x^2\end{align*} | \begin{align*}y^2\end{align*} |
---|---|---|---|---|---|
1 | 595 | 3.4 | 2023 | 354025 | 11.56 |
2 | 520 | 3.2 | 1664 | 270400 | 10.24 |
3 | 715 | 3.9 | 2789 | 511225 | 15.21 |
4 | 405 | 2.3 | 932 | 164025 | 5.29 |
5 | 680 | 3.9 | 2652 | 462400 | 15.21 |
6 | 490 | 2.5 | 1225 | 240100 | 6.25 |
7 | 565 | 3.5 | 1978 | 319225 | 12.25 |
Sum | 3970 | 22.7 | 13262 | 2321400 | 76.01 |
Applying the formula to these data, we find the following:
\begin{align*}r_{XY} & = \frac{n\sum xy-\sum x \sum y}{\sqrt{\left [ n\sum x^2- \left ( \sum x \right )^2 \right ]} \sqrt{\left [ n \sum y^2-\left ( \sum y \right )^2 \right ]}} = \frac{(7)(13262)-(3970)(22.7)}{\sqrt{[(7)(2321400)-3970^2][(7)(76.01)-22.7^2]}}\\ & = \frac{2715}{2864.22} \approx 0.95\end{align*}
The correlation coefficient not only provides a measure of the relationship between the variables, but it also gives us an idea about how much of the total variance of one variable can be associated with the variance of the other. For example, the correlation coefficient of 0.95 that we calculated above tells us that to a high degree, the variance in the scores on the verbal SAT is associated with the variance in the GPA, and vice versa. For example, we could say that factors that influence the verbal SAT, such as health, parent college level, etc., would also contribute to individual differences in the GPA. The higher the correlation we have between two variables, the larger the portion of the variance that can be explained by the independent variable.
The calculation of this variance is called the coefficient of determination and is calculated by squaring the correlation coefficient. Therefore, the coefficient of determination is written as \begin{align*}r^2\end{align*}. The result of this calculation indicates the proportion of the variance in one variable that can be associated with the variance in the other variable.
The Properties and Common Errors of Correlation
Correlation is a measure of the linear relationship between two variables\begin{align*}-\end{align*}it does not necessarily state that one variable is caused by another. For example, a third variable or a combination of other things may be causing the two correlated variables to relate as they do. Therefore, it is important to remember that we are interpreting the variables and the variance not as causal, but instead as relational.
When examining correlation, there are three things that could affect our results: linearity, homogeneity of the group, and sample size.
Linearity
As mentioned, the correlation coefficient is the measure of the linear relationship between two variables. However, while many pairs of variables have a linear relationship, some do not. For example, let’s consider performance anxiety. As a person’s anxiety about performing increases, so does his or her performance up to a point. (We sometimes call this good stress.) However, at some point, the increase in anxiety may cause a person's performance to go down. We call these non-linear relationships curvilinear relationships. We can identify curvilinear relationships by examining scatterplots (see below). One may ask why curvilinear relationships pose a problem when calculating the correlation coefficient. The answer is that if we use the traditional formula to calculate these relationships, it will not be an accurate index, and we will be underestimating the relationship between the variables. If we graphed performance against anxiety, we would see that anxiety has a strong affect on performance. However, if we calculated the correlation coefficient, we would arrive at a figure around zero. Therefore, the correlation coefficient is not always the best statistic to use to understand the relationship between variables.
Homogeneity of the Group
Another error we could encounter when calculating the correlation coefficient is homogeneity of the group. When a group is homogeneous, or possesses similar characteristics, the range of scores on either or both of the variables is restricted. For example, suppose we are interested in finding out the correlation between IQ and salary. If only members of the Mensa Club (a club for people with IQs over 140) are sampled, we will most likely find a very low correlation between IQ and salary, since most members will have a consistently high IQ, but their salaries will still vary. This does not mean that there is not a relationship\begin{align*}-\end{align*}it simply means that the restriction of the sample limited the magnitude of the correlation coefficient.
Sample Size
Finally, we should consider sample size. One may assume that the number of observations used in the calculation of the correlation coefficient may influence the magnitude of the coefficient itself. However, this is not the case. Yet while the sample size does not affect the correlation coefficient, it may affect the accuracy of the relationship. The larger the sample, the more accurate of a predictor the correlation coefficient will be of the relationship between the two variables.
Lesson Summary
Bivariate data are data sets with two observations that are assigned to the same subject. Correlation measures the direction and magnitude of the linear relationship between bivariate data. When examining scatterplot graphs, we can determine if correlations are positive, negative, perfect, or zero. A correlation is strong when the points in the scatterplot are close together.
The correlation coefficient is a precise measurement of the relationship between the two variables. This index can take on values between and including \begin{align*}-1.0\end{align*} and +1.0.
To calculate the correlation coefficient, we most often use the raw score formula, which allows us to calculate the coefficient by hand.
This formula is as follows: \begin{align*}r_{XY} = \frac{n \sum xy - \sum x \sum y}{\sqrt{ \left [ n\sum x^2-\left ( \sum x \right )^2 \right ]} \sqrt{ \left [ n \sum y^2 - \left ( \sum y \right )^2 \right ]}}\end{align*}.
When calculating the correlation coefficient, there are several things that could affect our computation, including curvilinear relationships, homogeneity of the group, and the size of the group.
Multimedia Links
For an explanation of the correlation coefficient (13.0), see kbower50, The Correlation Coefficient (3:59).
Review Questions
- Give 2 scenarios or research questions where you would use bivariate data sets.
- In the space below, draw and label four scatterplot graphs. One should show:
- a positive correlation
- a negative correlation
- a perfect correlation
- a zero correlation
- In the space below, draw and label two scatterplot graphs. One should show:
- a weak correlation
- a strong correlation.
- What does the correlation coefficient measure?
- The following observations were taken for five students measuring grade and reading level.
Student Number | Grade | Reading Level |
---|---|---|
1 | 2 | 6 |
2 | 6 | 14 |
3 | 5 | 12 |
4 | 4 | 10 |
5 | 1 | 4 |
(a) Draw a scatterplot for these data. What type of relationship does this correlation have?
(b) Use the raw score formula to compute the Pearson correlation coefficient.
- A teacher gives two quizzes to his class of 10 students. The following are the scores of the 10 students.
Student | Quiz 1 | Quiz 2 |
---|---|---|
1 | 15 | 20 |
2 | 12 | 15 |
3 | 10 | 12 |
4 | 14 | 18 |
5 | 10 | 10 |
6 | 8 | 13 |
7 | 6 | 12 |
8 | 15 | 10 |
9 | 16 | 18 |
10 | 13 | 15 |
(a) Compute the Pearson correlation coefficient, \begin{align*}r\end{align*}, between the scores on the two quizzes.
(b) Find the percentage of the variance, \begin{align*}r^2\end{align*}, in the scores of Quiz 2 associated with the variance in the scores of Quiz 1.
(c) Interpret both \begin{align*}r\end{align*} and \begin{align*}r^2\end{align*} in words.
- What are the three factors that we should be aware of that affect the magnitude and accuracy of the Pearson correlation coefficient?