Modeling Linear Growth: Learn It 3

Linear Regression

Sometimes writing a formula to represent a particular situation is not an easy task. The situation may be complicated or you just may not have enough information to make the determination that a linear relationship is appropriate for the data. In this case, it is helpful to store the data in a spreadsheet and create a scatterplot as you did in the previous section then let the computer find the appropriate model.

linear regression

Linear regression is the process of finding the equation of the line that best “fits” the data.  Linear regression can be performed simply by “eyeballing” a line that minimizes the distance between the output and the line.

We can also use software or a graphing calculator to find this equation. Technology uses a method called Least Squares Regression.  This term is not entirely interchangeable with the term linear regression but is often used to mean the same thing: finding the line of best fit.

correlation coefficient

The Correlation Coefficient, [latex]r[/latex], is a value between [latex]-1[/latex] and [latex]1[/latex] that is returned by the least squares method and measures the correlation between the input and output variables of a model. In our linear models, a positive correlation coefficient, [latex]r \gt 0[/latex], would indicate a positive slope while a negative correlation coefficient [latex]r \lt 0[/latex], would indicate a negative slope.

Another measure, the square of correlation coefficient [latex]r^2[/latex], is called the Coefficient of Determination.

coefficient of determination

The coefficient of determination, denoted [latex]R^2[/latex] or [latex]r^2[/latex] (pronounced “[latex]R[/latex] squared”).

 

This value is an indicator of how appropriate the regression line is as a model for the situation presented by the data. It ranges from 0 (not at all appropriate) to 1 (perfectly appropriate) as a measure of the proportion of output variability that can be explained by the input in the model. The coefficient of determination is often expressed as a percentage: [latex](r^2)*100[/latex].

The reason that we use the symbol [latex]R^{2}[/latex] is that the coefficient of determination is equal to the square of the correlation coefficient [latex]r[/latex]. Because of this, [latex]R^2[/latex] is more sensitive to differences in the strength of the linear relationship between the two variables than [latex]r[/latex] is. This increased sensitivity can be seen in the graphic below; the difference between [latex]R^2[/latex] values is greater than the difference between corresponding [latex]r[/latex] values.

Several graphs showing different correlations. They are, in order, Perfect Positive Correlation, Strong Positive Correlation, Weak Positive Correlation, No Correlation, Weak Negative Correlation, Strong Negative Correlation, and Perfect Negative Correlation. The graph in the middle has loosely scattered dots, and to either side, they get more closely clustered together. To the left, they have an upwards slope and to the right they have a downwards slope. Beneath the graphs, in order from left to right, it reads r = 1, r = 0.1, r = 0.8, r = 0, r = -0.48, r = -0.91, r = -1. Beneath that, in order, it reads R squared equals 1, R squared is approximately 0.3, R squared is approximately 0.3, R squared equals 0, R squared is approximately 0.3, R squared is approximately 0.3, R squared equals 1.

 

We will not go into more detail here about how [latex]R^2[/latex] is calculated; instead, you will practice finding (using technology) and interpreting this value. If you are curious about how this quantity is computed, see this video on calculating [latex]R^2[/latex].

You can view the transcript for “R-squared or coefficient of determination | Regression | Probability and Statistics | Khan Academy” here (opens in new window).

[latex]R^{2}[/latex] and Scatterplot Shape

The coefficient of determination, [latex]R^2[/latex], is a measure of the proportion of the variation of a response variable in linearly related bivariate data that can be explained by its relationship with the explanatory variable. You should understand that:

  • [latex]R^2[/latex] is equivalent to the square of the correlation coefficient [latex]r[/latex] and will always be a positive number between [latex]0\%[/latex] and [latex]100\%[/latex].
  • [latex]R^2[/latex] should be interpreted and written as a percentage.

Consider what you already understand about the shape and spread of a scatterplot.

  • The strongest linear relationships appear in plots as data that is roughly linear in shape with data points that lie very close to some line.
  • Weaker relationships may be very roughly linear in shape and more spread out, with data points that lie further from some line.
  • Non-linear relationships have data points that either form other shapes or are randomly scattered across the plot.