9.1: Scatterplots and Linear Correlation
Learning Objectives
- Understand the concept of bivariate data, correlation and the use of scatterplots to display bivariate data.
- Understand when the terms “positive,” “negative” “strong,” and “perfect” apply to correlation between two variables in a scatterplot graph.
- Calculate the linear correlation coefficient and coefficient of determination using technology tools to assist in the calculations.
- Understand properties and common errors of correlation.
Introduction
So far we have learned how to describe the distribution of a single variable and how to perform hypothesis tests that determine if samples are representative of a population. But what if we notice that two variables seem to be related to one another and we want to determine the nature of the relationship. For example, we may notice that scores for two variables – such as verbal SAT score and GPA – are related and that students that have high scores on one appear to have high scores on another (see table below).
Student | SAT Score | GPA |
---|---|---|
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 | ||
7 |
These types of studies are quite common and we can use the concept of correlation to describe the relationship between variables.
Bivariate Data, Correlation Between Values and the Use of Scatterplots
Correlation measures the relationship between bivariate data. In general, bivariate data are data sets with two observations that are assigned to the same subject. In our example above, we notice that there are two observations (verbal SAT score and GPA) for each ‘subject’ (in this case, a student). Can you think of other scenarios when we would use bivariate data?
As mentioned, correlation measures the relationship between two variables. If we carefully examine the data in the example above we notice that those students with high SAT scores tend to have high GPAs and those with low SAT scores tend to have low GPAs. In this case, there is a tendency for students to ‘score’ similarly on both variables and the performance between variables appears to be related.
Scatterplots display these bivariate data sets and provide a visual representation of the relationship between variables. In a scatterplot, each point represents a paired measurement of two variables for a specific subject. Each subject is represented by one point on the scatterplot which corresponds to the intersection of imaginary lines drawn through the two observations in the bivariate data set. Therefore, each point represents a paired measurement (see below).
Correlation Patterns in Scatterplot Graphs
Simply examining a scatterplot graph allows us to obtain some idea about the relationship between two variables. Typical patterns include:
- A positive correlation - When the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), we say that there is a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well and vice versa.
- A negative correlation – When the points on a scatterplot graph produce a upper-left-to-lower-right pattern (see below), we say that there is a negative correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be low and vice versa.
- A perfect correlation – If there is a perfect correlation between the two variables, all of the points in the scatterplot will lie on a straight line (see below).
- Zero correlation – A scatterplot in which the points do not have a linear trend (either positive or negative) is called a zero or a near-zero correlation (see below).
When examining scatterplots, we also want to look at the magnitude of the relationship. If we drew an imaginary oval around all of the points of the scatterplot, we would be able to see the extent or the magnitude of the relationship. If the points are close to one another and the width of the imaginary oval is small, this means that there is a strong correlation between the variables (see below).
However, if the points are far away from one another and the imaginary oval is very wide, this means that there is a weak correlation between the variables (see below).
Correlation Coefficients
While examining scatterplots gives us some idea about the relationship of two variables, we use a statistic something called the correlation coefficient to give us a more precise measurement of the relationship between two variables.The correlation coefficient is an index that describes the relationship between two variables and can take on values between and . We can tell a lot from a correlation coefficient including:
- A positive correlation coefficient (, etc.) indicates a positive correlation.
- A negative correlation coefficient (, etc.) indicates a negative correlation.
- The absolute value of the coefficient indicates the magnitude or the strength of the relationship. The closer the absolute value of the coefficient is to , the stronger the relationship. For example, a correlation coefficient of indicates that there is not mush of a relationship between the variables while a coefficient of indicates that there is a strong linear relationship.
- The value of a perfect positive correlation is while the value of a perfect negative correlation is .
- When there is no linear relationship between two variables, the correlation coefficient is .
The most often used correlation coefficient is the Pearson product-moment correlation coefficient, or the linear correlation, which is symbolized by the letter . To understand how this coefficient is calculated, let’s suppose that there is a positive relationship between two variables ( and ). If a subject has a score on that is above the mean, we expect them to have a score on that is above the mean as well. Pearson developed his correlation coefficient by computing the sum of cross products which is multiplying the two scores ( and ) for each subject and then adding these cross products across the individuals. Then, he divided this sum by the number of subjects minus one. In short, this coefficient is the mean of the cross products of scores.
Because Pearson was measuring the difference between two variables, he used standard scores (-scores, -scores, etc.) when determining the coefficient. Therefore, the formula for this coefficient is:
In other words, the coefficient is expressed as the sum of the cross products of the standard -scores divided by the number of degrees of freedom.
The equivalent formula that uses the raw scores rather than the standard scores is called the raw score formula, which is:
Again, this formula is most often used when calculating correlation coefficients from original data. Note that is used instead of because we are using actual data and not -scores. Let’s use our example from the introduction to demonstrate how to calculate the correlation coefficient using the raw score formula.
Example:
What is the Pearson product-moment correlation coefficient for these two variables?
Student | SAT Score | GPA |
---|---|---|
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 | ||
7 |
In order to calculate the correlation coefficient, we need to calculate several pieces of information including , and . Therefore:
Student | SAT Score | GPA | |||
---|---|---|---|---|---|
Sum |
Applying the formula to these data we find:
The correlation coefficient not only provides a measure of the relationship between the variables, but also gives us an idea about how much of the total variance of one variable can be associated with the variance of another. For example, the correlation coefficient of that we calculated above tells us that to a high degree the variance in the scores on the verbal SAT is associated with the variance in the GPA and vice versa. For example, we could say that factors that influence the verbal SAT, such as health, parent college level, etc. would also contribute to individual differences in the GPA. The higher the correlation we have between two variables, the larger the portion of the variance that can be explained.
The calculation of this variance is called the coefficient of determination and is calculated by squaring the correlation coefficient . The result of this calculation indicates the proportion of the variance in one variable that can be associated with the variance in the other variable. We can think about this concept by examining a series of overlapping circles. The varying degrees of overlap in the circles reflect the proportion of the variance in that can be associated with the variance in . We will study this concept more in depth in later sections.
The Properties and Common Errors of Correlation
Again, correlation indicates the linear relationship between two variables – it does not necessarily state that one variable is caused by another. For example, a third variable or a combination of other things may be causing the two correlated variables to relate as they do. Therefore, it is important to remember that we are interpreting the variables and the variance as not causal, but instead as relational.
When examining correlation, there are three things that could affect our results:
- Linearity
- Homogeneity of the group
- Sample size
As mentioned, the correlation coefficient is the measure of the linear relationship between two variables. However, while many pairs of variables have a linear relationship, some do not. For example, let’s consider performance anxiety. As a person’s anxiety about performing increases, so does their performance up to a point (we sometimes call this ‘good stress’). However, at that point the increase in the anxiety may cause their performance to go down. We call these non-linear relationships curvilinear relationships.
We can identify curvilinear relationships by examining scatterplots (see below). One may ask why curvilinear relationships pose a problem when calculating the correlation coefficient. The answer is that if we use the traditional formula to calculate these relationships, it will not be an accurate index and we will be underestimating the relationship between the variables. If we graphed performance against anxiety, we would see that anxiety has a strong affect on performance. However, if we calculated the correlation coefficient, we would arrive at a figure around zero. Therefore, the correlation coefficient is not always the best statistic to use.
Another error we could encounter when calculating the correlation coefficient is homogeneity of the group. When a group is homogeneous or possessing similar characteristics, the range of scores on either or both of the variables is restricted. For example, suppose we are interested in finding out the correlation between IQ and salary. If only members of the Mensa Club (a club for people with IQs over ) are sampled, we will most likely find a very low correlation between IQ and salary since most members will have a consistently high IQ but their salaries will vary. This does not mean that there is not a relationship – it simply means that the restriction of the sample limited the magnitude of the correlation coefficient.
Finally, we should consider sample size. One may assume that the number of observations used in the calculation of the coefficient may influence the magnitude of the coefficient itself. However, this is not the case. While the number in the sample size does not affect the coefficient, it may affect the accuracy of the relationship. The larger the sample, the more accurate of a predictor the correlation coefficient will be on the relationship between the two variables.
Lesson Summary
- Bivariate data are data sets with two observations that are assigned to the same subject. Correlation measures the direction and magnitude of the linear relationship between bivariate data.
- When examining scatterplot graphs, we can determine if correlations are positive, negative, perfect or zero. A correlation is strong when the points in the scatterplot are close together.
- The correlation coefficient is a precise measurement of the relationship between the two variables. This index can take on values between and including and .
- To calculate the correlation coefficient, we most often use the raw score formula which allows us to calculate the coefficient by hand. This formula is: .
- When calculating correlation, there are several things that could affect our computation including curvilinear relationships, homogeneity of the group and the size of the group.
Review Questions
- Please give scenarios or research questions where you would use bivariate data sets.
- In the space below, please draw and label four scatterplot graphs showing (a) a positive correlation, (b) a negative correlation, (c) a perfect correlation and zero correlation.
- In the space below, please draw and label two scatterplot graphs showing (a) a weak correlation and (b) a strong correlation.
- What does the correlation coefficient measure?
The following observations were taken for five students measuring grade and reading level.
Student Number | Grade | Reading Level |
---|---|---|
- Draw a scatterplot for these data. What type of relationship does this correlation have?
- Use the raw score formula to compute the Pearson correlation coefficient.
A teacher gives two quizzes to his class of students. The following are the scores of the students.
Student | Quiz 1 | Quiz 2 |
---|---|---|
- Compute the Pearson correlation coefficient between the scores on the two quizzes.
- Find the percentage of the variance in the scores of Quiz 2 associated with the variance in the scores of Quiz 1.
- Interpret both and in words.
- What are the three factors that we should be aware of that affect the size and accuracy of the Pearson correlation coefficient?
Review Answers
- Various answers are possible. Answers could include scores between two tests, effectiveness of two medications, behavior patterns, etc.
- Various answers are possible.
- Various answers are possible.
- The correlation coefficient measures the nature and the magnitude of the linear relationship between two variables.
- The scatterplot should show the points plotted in a line. This is a perfect correlation.
- The correlation between the two quizzes is positive and is moderately strong. Only a small proportion of the variance is shared by the two variables
- Curvilinear relationships, homogeneity of the group and small group size.