Module 5: Cheat Sheet

Download a PDF of this page here.

Essential Concepts

  • Bivariate statistics are statistics that measure the relationship between two variables.
  • Scatterplots visually show the relationship (or lack of a relationship) between two quantitative variables.
  • The explanatory variable goes on the x-axis and the response variable goes on the y-axis when creating a scatterplot.
  • Trends:
    • A scatterplot shows a positive trend if the response variable (represented on the vertical axis) tends to increase as the explanatory variable (represented on the horizontal axis) increases.
    • If the response variable tends to decrease as the explanatory variable increases, then the scatterplot shows a negative trend.
    • Two variables are not associated if knowing the value of one variable does not give you any information about the other variable.
  • The correlation is a measure of the strength of the linear relationship between quantitative variables. A scatterplot with low correlation may still present a strong nonlinear relationship.
  • The relationship between two variables is said to be linear when the points on the scatterplot resemble a straight line.
  • A statistic for measuring the strength and direction of the linear relationship between two quantitative variables is the Pearson Correlation Coefficient, [latex]r[/latex].
    • The correlation coefficient, [latex]r[/latex], is always between [latex]-1[/latex] and [latex]1[/latex].
    • The correlation coefficient has no unit and remains the same when the response and explanatory variables are reversed or if the units are exchanged.
    • The sign of the correlation coefficient describes the direction, either positive (increasing) or negative (decreasing), of the association.
    • The values of the correlation coefficient describes the strength of the linear association between the response and explanatory variables.
  • The following table illustrates the strength of linear relationships.
Correlation Coefficient, [latex]r[/latex] General Interpretation
[latex]-1[/latex] to [latex]-0.7[/latex] Strong negative linear relationship
[latex]-0.7[/latex] to [latex]-0.3[/latex] Moderate negative linear relationship
[latex]-0.3[/latex] to [latex]-0.1[/latex] Weak negative linear relationship
[latex]-0.1[/latex] to [latex]0.1[/latex] Negligible or no linear relationship
[latex]0.1[/latex] to [latex]0.3[/latex] Weak positive linear relationship
[latex]0.3[/latex] to [latex]0.7[/latex] Moderate positive linear relationship
[latex]0.7[/latex] to [latex]1[/latex] Strong positive linear relationship
  • Association does not imply causation. Do not interpret a high correlation between explanatory and response variables as a cause-and-effect relationship.
  • Outliers appear as departures from the general trend, i.e. extreme observations in a bivariate data.
  • The explanatory variable ([latex]x[/latex]) is the variable that is thought to explain or predict the response variable of a study.
  • The response variable ([latex]\hat{y}[/latex]) measures the outcome of interest in the study. This variable is thought to depend in some way on the explanatory variable. It is often referred to as the “variable of interest” for the researcher. The explanatory variable is used to predict/calculate/determine the response variable.
  • The Least Squares Regression (LSR) analysis is a statistical method used to make predictions about missing observations in bivariate data. It can also be described as linear modeling.
  • The line of best fit, [latex]\hat{y} = a+bx[/latex], is the line that minimizes the distance between the data points and the line. The line of best fit is also called the Least Squares Regression Line (LSRL).
  • The estimated slope [latex]b[/latex] tells us the predicted change in [latex]\hat{y}[/latex]given a one-unit increase in the value of the explanatory variable [latex]x[/latex].
    • Interpretation: For every one (unit) increase in (explanatory variable units), we predict an average increase/decrease of ___ (response variable units) in (response variable).
  • Within the range of the explanatory variable, we can use the line of best fit to make predictions.
  • Extrapolation is the prediction of a response value using an explanatory variable value that is outside the range of the original data.
  • The Coefficient of Determination, [latex]R^2[/latex] or [latex]r^2[/latex], is the proportion of the variation in the response variable that can be explained by its linear relationship with the explanatory variable. The coefficient of determination is the square of the correlation coefficient. [latex]R^2[/latex] should be interpreted as a percentage.
  • The residual for a data point is the difference between the observed and predicted values. It is can be seen as the vertical distance between the point and the line of best fit on the scatterplot.
  • A residual plot is the graph of the residuals for each data point using the same x-axis scale as the original scatterplot. A residual that shows a random scattering of points indicates that a linear function is an appropriate regression model. If the residual plot demonstrates a curve or trend, then it indicates that a linear regression model may not be the best fit for the data.
  • In fitting a regression line, an outlier can be an observation that does or does not fit the linear pattern. An influential point is an outlier that has a significant impact on the parameters of the regression model (namely the slope, y-intercept, correlation coefficient, and coefficient of determination).
  • The residual standard error, [latex]s_e[/latex], is a measure of the variability in the residuals, which quantify the spread of the points around the line of best fit on the scatterplot.

Key Equations

Pearson Correlation Coefficient, [latex]r[/latex]

[latex]r=\frac{\Sigma\left(\frac{x-\stackrel{¯}{x}}{{s}_{x}}\right)\left(\frac{y-\stackrel{¯}{y}}{{s}_{y}}\right)}{n-1}[/latex]

Line of best fit

[latex]\hat{y} = a+bx[/latex]

where [latex]\hat{y}[/latex] is the general predicted value of the response variable (pronounced y-hat), a is the estimated value of the y-intercept, and [latex]b[/latex] is the estimated slope.

The estimated slope

[latex]b=r \frac{S_y}{S_x}[/latex]

where [latex]S_y[/latex] and [latex]S_x[/latex] are the sample standard deviations for the response and explanatory variables and [latex]r[/latex] is the correlation coefficient for the data set.

The estimated y-intercept

[latex]a= \bar{y} -b \bar{x}[/latex]

where [latex]\bar{y}[/latex] and [latex]\bar{x}[/latex] are the sample means for the response and explanatory variables.

Residual

Residual = [latex]y-\hat{y}[/latex]

Residual Standard Error

[latex]s_e = \sqrt{\dfrac{1}{n-2}\left(y_i-\hat{y}_i\right)^{2}}[/latex] where [latex]y_i-\hat{y}_i[/latex] denotes the residual of each data point

Glossary

[latex]a[/latex]

the estimated value of the y-intercept of the line best fit

[latex]b[/latex]

the estimated slope or the constant rate of change of the line best fit

bivariate data

two variables linked because both observations are measured from the same individual or unit

coefficient of determination, [latex]R^2[/latex], [latex]r^2[/latex]

proportion of the variation in the response variable that can be explained by its linear relationship with the explanatory variable

explanatory variable

the variable that is thought to explain or predict the response variable of a study

extrapolation

prediction for values of the explanatory variable that fall outside the range of the data

influential point

a point that drastically changes the equation of the line, consequently increasing the values of all of the residuals

linear

points on the scatterplot resemble a straight line

negative trend 

if the response variable (represented on the vertical axis) tends to increase as the explanatory variable (represented on the horizontal axis) decreases

non-linear

can appear scattered about a smooth curve or have no patterns at all

outlier

data that appears as departures from the general trend

positive trend 

if the response variable (represented on the vertical axis) tends to increase as the explanatory variable (represented on the horizontal axis) increases

[latex]r[/latex]

Pearson Correlation Coefficient of two quantitative variables

residual

the vertical error associated with each data point from the line best fit

residual standard error

the measure of the variability of the residuals

response variable

measures the outcome of interest in the study

scatterplots

show the relationship (or lack of a relationship) between two quantitative variables

[latex]x[/latex]

the explanatory variable of the bivariate data

[latex]\hat{y}[/latex]

the predicted value of the response variable