Correlation
Correlation measures the relationship between bivariate data. Scatterplots display these bivariate data sets and provide a visual representation of the relationship between variables.
Examining a scatterplot graph allows us to obtain some idea about the relationship between two variables.
- lower-left-to-upper-right pattern --> positive correlation
- upper-left-to-lower-right pattern --> negative correlation
- straight line --> perfect correlation
- no linear trend --> zero correlation or a near-zero correlation
Correlation Coefficients
While examining scatterplots gives us some idea about the relationship between two variables, we use a statistic called the correlation coefficient to give us a more precise measurement of the relationship between the two variables. The correlation coefficient is an index that describes the relationship and can take on values between \begin{align*}-1.0\end{align*} and +1.0, with a positive correlation coefficient indicating a positive correlation and a negative correlation coefficient indicating a negative correlation.
The absolute value of the coefficient indicates the magnitude, or the strength, of the relationship. The closer the absolute value of the coefficient is to 1, the stronger the relationship. For example, a correlation coefficient of 0.20 indicates that there is a weak linear relationship between the variables, while a coefficient of \begin{align*}-0.90\end{align*} indicates that there is a strong linear relationship.
The value of a perfect positive correlation is 1.0, while the value of a perfect negative correlation is -1.0 .
When there is no linear relationship between two variables, the correlation coefficient is 0. Note: It is important to remember that a correlation coefficient of 0 indicates that there is no linear relationship, but there may still be a strong relationship between the two variables. For example, there could be a quadratic relationship between them.
Calculating the Regression Line
Linear regression involves using data to calculate a line that best fits that data and then using that line to predict scores. In linear regression, we use one variable (the predictor variable) to predict the outcome of another (the outcome variable, or criterion variable).
Least squares regression is a method of fitting the data line so that there is minimal difference between the observations and the line. In the example below, you can see the calculated distances, or residual values, from each of the observations to the regression line.
As you can see, the regression line is a straight line that expresses the relationship between two variables. When predicting one score by using another, we use an equation such as the following, which is equivalent to the slope-intercept form of the equation for a straight line:
\begin{align*}Y = bX + a\end{align*}
where:
\begin{align*}Y\end{align*} is the score that we are trying to predict.
\begin{align*}b\end{align*} is the slope of the line.
\begin{align*}a\end{align*} is the \begin{align*}y\end{align*} -intercept, or the value of \begin{align*}Y\end{align*} when the value of \begin{align*}X\end{align*} is 0.
To calculate the line itself, we need to find the values for \begin{align*}b\end{align*} (the regression coefficient ) and \begin{align*}a\end{align*} (the regression constant ).
We use the following formula to calculate the regression coefficient:
\begin{align*}b & = \frac{n\sum xy-\sum x \sum y}{n \sum x^2-\left ( \sum x \right )^2}\\ \text{or}\\ b & = (r) \frac{s_Y}{s_X}\end{align*}
where:
\begin{align*}r\end{align*} is the correlation between the variables \begin{align*}X\end{align*} and \begin{align*}Y\end{align*} .
\begin{align*}s_Y\end{align*} is the standard deviation of the \begin{align*}Y\end{align*} scores.
\begin{align*}s_X\end{align*} is the standard deviation of the \begin{align*}X\end{align*} scores.
We use the following formula to calculate the regression constant:
\begin{align*}a = \frac{\sum y - b \sum x}{n} = \bar{y}-b\bar{x}\end{align*}
Hypothesis Testing for Linear Relationships
In hypothesis testing of linear regression models, the null hypothesis to be tested is that the regression coefficient, \begin{align*}\beta\end{align*}, equals zero. Our alternative hypothesis is that our regression coefficient does not equal zero.
\begin{align*}H_0 : \ \beta & = 0\\ H_a : \ \beta & \neq 0\end{align*}
The test statistic for this hypothesis test is calculated as follows:
\begin{align*}t &= \frac{b-\beta}{s_b}\\ \text{where} \qquad s_b &= \frac{s}{\sqrt{\sum (x-\bar{x})^2}} = \frac{s}{\sqrt{SS_X}},\\ s &= \sqrt{\frac{SSE}{n-2}}, \text{ and}\\ SSE &= \text{sum of residual error squared}\end{align*}
When predicting values using multiple regression, we first use the standard score form of the regression equation, which is shown below:
\begin{align*}\hat{Y} = \beta_1X_1 + \beta_2X_2 + \ldots + \beta_iX_i\end{align*}
where:
\begin{align*}\hat{Y}\end{align*} is the predicted variable, or criterion variable.
\begin{align*}\beta_i\end{align*} is the \begin{align*}i^{\text{th}}\end{align*} regression coefficient.
\begin{align*}X_i\end{align*} is the \begin{align*}i^{\text{th}}\end{align*} predictor variable.