2.14: Plotting a Scatterplot and Finding the Equation of Best Fit
The calorie requirements for males is shown in the table below. What type of correlation is exhibited by the data?
Calorie Requirements (Male), 1-59 years
Age Range, \begin{align*}x\end{align*} | 1-3 | 4-6 | 7-10 | 11-14 | 15-18 | 19-59 |
---|---|---|---|---|---|---|
Calorie Needs, \begin{align*}y\end{align*} | 1230 | 1715 | 1970 | 2220 | 2755 | 2550 |
The age is measured in years. Source: www.fatfreekitchen.com
Guidance
A scatterplot is a set of points that represent data. We plot these points and try to find equations that best approximate their relationship. Because data is not always perfect, not every point will always fit on the line of best fit. The line of best fit is the line that is the closest to all the data points. It can be used to approximate data within the set or beyond the set. Scatterplots almost always represent a real-life situation.
Scatterplots can have positive correlation if the \begin{align*}x\end{align*} and \begin{align*}y\end{align*} values tend to increase together. They can have negative correlation if \begin{align*}y\end{align*} tends to decrease as \begin{align*}x\end{align*} tends to increase. And, if the points have no sort of linear pattern, then the data would have relatively no correlation. Think of the type of correlations referring to the slope of the line that would best fit that data.
Example A
Describe the type of correlation shown in the scatterplot. Explain your answer.
Source: money.cnn.com
Solution: This is a negative correlation. As the years get larger, the sales go down. This could be because in the boom of online/digital and pirated music.
Example B
Find the linear equation of best fit for the data set above.
Solution: First, it can be very difficult to determine the “best” equation for a set of points. In general, you can use these steps to help you.
- Draw the scatterplot on a graph.
- Sketch the line that appears to most closely follow the data. Try to have the same number of points above and below the line.
- Choose two points on the line and estimate their coordinates. These points do not have to be part of the original data set.
- Find the equation of the line that passes through the two points from Step 3.
Let’s use these steps on the graph above. We already have the scatterplot drawn, so let’s sketch a couple lines to find the one that best fits the data.
From the lines in the graph, it looks like the purple line might be the best choice. The red line looks good from 2006-2009, but in the beginning, all the data is above it. The green line is well below all the early data as well. Only the purple line cuts through the first few data points, and then splits the last few years. Remember, it is very important to have the same number of points above and below the line.
Using the purple line, we need to find two points on it. The second point, crosses the grid perfectly at (2000, 14). Be careful! Our graph starts at 1999, so that would be considered zero. Therefore, (2000, 14) is actually (1, 14). The line also crosses perfectly at (2007, 10) or (8, 10). Now, let’s find the slope and \begin{align*}y-\end{align*}intercept.
\begin{align*}m = \frac{14-10}{1-8} = - \frac{4}{7}\end{align*}
\begin{align*}y &= - \frac{4}{7}x+b\\ 14 &= - \frac{4}{7}(1)+b \\ 14 &= -0.57+b\\ 14.57 &= b\end{align*}
The equation of best fit is \begin{align*}y = - \frac{4}{7}x+14.57\end{align*}.
However, the equation above assumes that \begin{align*}x\end{align*} starts with zero. In actuality, we started with 1999, so our final equation is \begin{align*}y = - \frac{4}{7}(x-1999)+14.57\end{align*}.
Example C
Using the line of best fit above, what would you expect music sales to be in 2010?
Solution: In this example, we are using the line of best fit to predict data. Plug in 2010 for \begin{align*}x\end{align*} and solve for \begin{align*}y\end{align*}.
\begin{align*}y &= - \frac{4}{7}(2010-1999)+14.57\\ y &= - \frac{4}{7}(11)+14.57\\ y &=8.3\end{align*}
It is estimated that music industry will make $8.3 billion in music sales in 2010.
Intro Problem Revisit If you draw a scatter plot of the data, you see that the x and y values tend to increase together. Therefore the data exhibits positive correlation. That is, as age increases so do calorie requirements.
Vocabulary
- Scatterplot
- A set of data that is plotted on a graph to see if a relationship exists between the points.
- Line of Best Fit
- The linear equation that best approximates a scatterplot.
Guided Practice
Use the table below to answer the following questions.
Sleep Requirements, 0-3 years
Age, \begin{align*}x\end{align*} | 1 | 3 | 6 | 9 | 12 | 18 | 24 | 36 |
---|---|---|---|---|---|---|---|---|
Sleep, \begin{align*}y\end{align*} | 16 | 15 | 14.25 | 14 | 13.75 | 13.5 | 13 | 12 |
The age is measured in months and sleep is measured in hours. Source: www.babycenter.com
1. Draw a scatterplot with age across the \begin{align*}x-\end{align*}axis and sleep along the \begin{align*}y-\end{align*}axis. Count by 3’s for the \begin{align*}x-\end{align*}values and by 2’s for the \begin{align*}y-\end{align*}values.
2. Using the steps from Example B, find the line of best fit.
3. Determine the amount of sleep needed for a \begin{align*}2 \frac{1}{2}\end{align*} year old and a 5 year old.
Answers
1. Here is the scatterplot.
2. Two points that seem to be on the red line are (3, 15) and (24, 13).
\begin{align*}m &= \frac{15-13}{3-24} = - \frac{2}{21}\\ 15 &= - \frac{2}{21}(3)+b\\ 15 &= -0.29+b\\ 15.29 &= b\end{align*}
The equation of the line is \begin{align*} y = - \frac{2}{21}x+15.29\end{align*}.
3. First, you need to change the age to months so that it corresponds with the units used in the graph. For a 2.5 year-old, 30 months, s/he should sleep \begin{align*}y = - \frac{2}{21}(30)+15.29 \approx 12.4\end{align*} hours. For a 5-year-old, 60 months, s/he should sleep \begin{align*}y = - \frac{2}{21}(60)+15.29 \approx 9.6\end{align*} hours.
Practice
Determine if the scatterplots below have positive, negative, or no correlation.
Plot each scatterplot and then determine the line of best fit.
\begin{align*}x\end{align*} | 1 | 2 | 3 | 5 | 7 | 8 |
---|---|---|---|---|---|---|
\begin{align*}y\end{align*} | 1 | 3 | 4 | 3 | 6 | 7 |
\begin{align*}x\end{align*} | 10 | 9 | 7 | 6 | 5 | 2 |
---|---|---|---|---|---|---|
\begin{align*}y\end{align*} | 5 | 6 | 4 | 3 | 3 | 2 |
Use the data below to answer questions 6-8.
The price of Apple stock from Oct 2009 - Sept 2011 source: Yahoo! Finance
10/09 | 11/09 | 12/09 | 1/10 | 2/10 | 3/10 | 4/10 | 5/10 | 6/10 | 7/10 | 8/10 | 9/10 |
---|---|---|---|---|---|---|---|---|---|---|---|
$181 | $189 | $198 | $214 | $195 | $208 | $236 | $249 | $266 | $248 | $261 | $258 |
10/10 | 11/10 | 12/10 | 1/11 | 2/11 | 3/11 | 4/11 | 5/11 | 6/11 | 7/11 | 8/11 | 9/11 |
$282 | $309 | $316 | $331 | $345 | $352 | $344 | $349 | $346 | $349 | $389 | $379 |
- Draw the scatterplot for the table above. Make the \begin{align*}x-\end{align*}axis the month and the \begin{align*}y-\end{align*}axis the price.
- Find the linear equation of best fit.
- According to your equation, what would be the predicted price of the stock in January 2012?
Use the data below to answer questions 9-11.
Total Number of Home Runs Hit in Major League Baseball, 2000-2010 source: www.baseball-almanac.com
2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 |
---|---|---|---|---|---|---|---|---|---|---|
5693 | 5458 | 5059 | 5207 | 5451 | 5017 | 5386 | 4957 | 4878 | 4655 | 4613 |
- Draw the scatterplot for the table above. Make the \begin{align*}x-\end{align*}axis the year and the \begin{align*}y-\end{align*}axis the number of home runs.
- Find the linear equation of best fit.
- According to your equation, how many total home runs should be hit in 2011?
bivariate
Bivariate data has two variablescorrelation
Correlation is a statistical method used to determine if there is a connection or a relationship between two sets of data.curvilinear relationships
Non-linear relationships are called curvilinear relationships.direct relationship
If the line on a line graph rises to the right, it indicates a direct relationship.homogeneity
When a group is homogeneous, or possesses similar characteristics, the range of scores on either or both of the variables is restricted.indirect relationship
If the line on a line graph falls to the right, it indicates an indirect relationship.linear relationship
A linear relationship appears as a straight line either rising or falling as the independent variable values increase.negative correlation
A negative correlation appears as a recognizable line with a negative slope .non-linear relationship
A non-linear relationship may take the form of any number of curved lines but is not a straight line.positive correlation
A positive correlation appears as a recognizable line with a positive slope .scatter plot
A scatter plot is a plot of the dependent variable versus the independent variable and is used to investigate whether or not there is a relationship or connection between 2 sets of data.Scatterplot
A scatterplot is a type of visual display that shows pairs of data for two different variables.Slope
Slope is a measure of the steepness of a line. A line can have positive, negative, zero (horizontal), or undefined (vertical) slope. The slope of a line can be found by calculating “rise over run” or “the change in the over the change in the .” The symbol for slope isstrong correlation
Two variables with a strong correlation will appear as a number of points occurring in a clear and recognizable linear pattern.trends
Trends in data sets or samples are indicators found by reviewing the data from a general or overall standpointweak correlation
Two variables with a weak correlation will appear as a much more scattered field of points, with only a little indication of points falling into a line of any sort.Image Attributions
Here you'll learn how to find a linear equation that best fits a set of data or points.
Concept Nodes:
bivariate
Bivariate data has two variablescorrelation
Correlation is a statistical method used to determine if there is a connection or a relationship between two sets of data.curvilinear relationships
Non-linear relationships are called curvilinear relationships.direct relationship
If the line on a line graph rises to the right, it indicates a direct relationship.homogeneity
When a group is homogeneous, or possesses similar characteristics, the range of scores on either or both of the variables is restricted.indirect relationship
If the line on a line graph falls to the right, it indicates an indirect relationship.linear relationship
A linear relationship appears as a straight line either rising or falling as the independent variable values increase.negative correlation
A negative correlation appears as a recognizable line with a negative slope .non-linear relationship
A non-linear relationship may take the form of any number of curved lines but is not a straight line.positive correlation
A positive correlation appears as a recognizable line with a positive slope .scatter plot
A scatter plot is a plot of the dependent variable versus the independent variable and is used to investigate whether or not there is a relationship or connection between 2 sets of data.Scatterplot
A scatterplot is a type of visual display that shows pairs of data for two different variables.Slope
Slope is a measure of the steepness of a line. A line can have positive, negative, zero (horizontal), or undefined (vertical) slope. The slope of a line can be found by calculating “rise over run” or “the change in the over the change in the .” The symbol for slope isstrong correlation
Two variables with a strong correlation will appear as a number of points occurring in a clear and recognizable linear pattern.trends
Trends in data sets or samples are indicators found by reviewing the data from a general or overall standpointweak correlation
Two variables with a weak correlation will appear as a much more scattered field of points, with only a little indication of points falling into a line of any sort.