Module 4: Cheat Sheet

Download a PDF of this page here.

Download the Spanish version here.

Essential Concepts

  • The median of a data set can be computed by ordering the data values and identifying the value in the middle.
  • The mean of a data set can be computed by finding the sum of the data values and dividing it by the number of data values in the data set.
  • The mean represents the balance point of the data, and the median represents the 50th percentile, or the value that splits the data in half.
  • When a distribution is symmetric, the mean and median occupy the same value. Under a skew, the mean is “pulled” in the direction of the outliers:
    • Right-skewed: the mean is greater than the median
    • Left-skewed: the mean is less than the median
  • The median stays relatively fixed in a data set if one value changes by a large amount, the mean does not. This is an indication that the mean is sensitive to the presence of extreme values in the data set and can be a misleading indicator of a “typical” observation value
  • A boxplot is a data display specifically designed to show something called the “five-number summary” which divides a data set into four equal sections. The boxplot has distinct points at the median, quartiles, and minimum and maximum of the data set. 
  • Q1, the lower quartile, represents the boundary of the first quarter of the data and Q3, the upper quartile, represents the boundary for the last quarter of the data. The IQR is calculated as [latex]Q3-Q1[/latex] and describes the spread of a boxplot.
  • Interquartile range [latex](IQR)[/latex] is the best method for determining if an observation is an outlier in the distribution. This fences, or boundaries, for the upper and lower outliers value equals either the distance [latex]1.5(IQR)[/latex] less than [latex]Q1[/latex] or greater than [latex]Q3[/latex]
  • Variability refers to a measure of how spread out the data in a data set is. Variability is measured using standard deviation, variance, IQR, and range.
  • In a histogram, variability can be judged by examining the distance of the bars from the statistical center (mean or median) of the graph. If the variability is high, equally sized or taller bars will appear away from the center of the graph. If the variability is low, the data will appear clustered around the center
  • Larger values of range indicate more variability in the data, but the range value only utilizes two observations in the entire data set to measure variability. This is not an ideal measure of spread, but when used in combination with other measures of spread, it can help you gain a clearer understanding of the spread of a distribution
  • The standard deviation is the “typical” or average distance of each data point to the mean of the data set.
  • Standard deviation is generally calculated with technology, but the following steps can be applied to calculate a standard deviation by hand:
    1. Calculate the mean of the population or sample
    2. Take the difference between each data value and the mean, then square each difference
    3. Add up all the squared differences
    4. Divide by either the total number of observations in the case of a population, or by 1 fewer than the total in the case of a sample
    5. Take the square root of the result from step 4
  • Standard deviation is the square root of the variance of a data set.
  • Similar to median, the IQR is considered a more accurate measure of spread for data that is skewed or contains outliers. Alternatively, the mean and standard deviation are considered more accurate measures when the data is symmetric because they utilize all data points as opposed to just one or two measures.
  • Standardizing the value includes finding the difference between the given value and the mean, and dividing that distance by the standard deviation. The resulting value is a number of standard deviations, and has no units associated with it.
  • Standardized scores, called z-scores for the standard normal distribution, can result in positive and negative values. A negative indicates value less than the mean and a positive indicates a value that is greater than the mean.
  • If a distribution is bell shaped, unimodal, and symmetric, the Empirical Rule states that:
    • about 68% of observations in a data set will be within one standard deviation of the mean
    • about 95% of observations in a data set will be within two standard deviations of the mean
    • about 99.7% of the observations in a data set will be within three standard deviations of the mean

Key Equations

Converting values into standardized scores

[latex]z=\dfrac{x-\mu}{\sigma}[/latex], where [latex]x[/latex] represents the value of the observation, [latex]\mu[/latex] represents the population mean, [latex]\sigma[/latex] represents the population standard deviation, and [latex]z[/latex] represents the standardized value, or z-score.

Deviation from the mean

[latex]\left(x-\bar{x}\right)[/latex], where [latex]\left(x\right)[/latex] is the observation in the data set, and [latex]\left(\bar{x}\right)[/latex] is the sample mean.

Interquartile range ([latex]IQR[/latex])

[latex]Q3–Q1[/latex]

Lower outlier “fence” or boundary

[latex]Q1-1.5(IQR)[/latex], remember to multiply [latex]1.5[/latex] by [latex]IQR[/latex] first, then subtract from [latex]Q1[/latex]

Mean

[latex]\dfrac{\text{sum of data values}}{\text{total number of data values}}[/latex] or [latex]\bar{x}=\dfrac{\sum{x}}{n}[/latex], where [latex]\bar{x}[/latex] is the mean, [latex]{\sum{x}}[/latex] is the symbol for “sum of,” [latex]{x}[/latex] represents the data values, and [latex]{n}[/latex] is the total number of data values.

Standard deviation of a population

[latex]\sigma = \sqrt{\dfrac{\sum \left(x-\mu\right)^2}{n}}[/latex], where [latex]\sum[/latex] is the summation of [latex]{\left(x-\mu\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\mu\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.

Standard deviation of a sample

[latex]s=\sqrt{\dfrac{\sum \left(x-\bar{x}\right)^2}{n-1}}[/latex], where [latex]\sum[/latex] is the summation of [latex]{\left(x-\bar{x}\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\bar{x}\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.

Upper outlier “fence” or boundary

[latex]Q3+1.5(IQR)[/latex], remember to multiply [latex]1.5[/latex] by [latex]IQR[/latex] first, then add to [latex]Q3[/latex]

Variance of a population

[latex]\sigma^{2}=\dfrac{\sum\left(x-\mu\right)^{2}}{n}[/latex], where [latex]\sum[/latex] is the summation of [latex]{\left(x-\mu\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\mu\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.

Variance of a sample

[latex]s^{2}=\dfrac{\sum\left(x-\bar{x}\right)^{2}}{n-1}[/latex], where [latex]\sum[/latex] is the summation of [latex]{\left(x-\bar{x}\right)^2}[/latex] for each observation, [latex]\left(x\right)[/latex] is the observation in the data set, [latex]\left(\bar{x}\right)[/latex] is the mean, and [latex]\left({n}\right)[/latex] is the number of observations.

Glossary

[latex]s[/latex]
the standard deviation of a sample of observations
[latex]\sigma[/latex]
the standard deviation of a population of observations
[latex]s^{2}[/latex]
the variation of a sample of observations
[latex]\sigma^{2}[/latex]
the variance of a population of observations
deviation from the mean
the distance between an observation ([latex]{x}[/latex]) in a data set and the mean [latex]\left(\bar{x}\right)[/latex] of the data set
Empirical Rule
a guideline that predicts the percentage of observations within a certain number of standard deviations. Also known as the [latex]\textbf{68-95-99.7}[/latex] Rule which states that in a bell-shaped, unimodal distribution, almost all of the observed data values, [latex]x[/latex], lie within three standard deviations, [latex]\sigma[/latex], to either side of the mean, [latex]\mu[/latex]. More specifically, about [latex]68\%[/latex] of observations in a data set will be within one standard deviation of the mean [latex]\left(\mu\pm\sigma\right)[/latex], about [latex]95\%[/latex] of the observations in a data set will be within two standard deviations of the mean [latex]\left(\mu\pm2\sigma\right)[/latex], and about [latex]99.7\%[/latex] of the observations in a data set will be within three standard deviations of the mean [latex]\left(\mu\pm3\sigma\right)[/latex]
first quartile
the value below which one quarter of the data lies, also equal to the [latex]25[/latex]th percentile. Sometimes denoted [latex]Q1[/latex]
five-number summary
the collection of the minimum, first quartile, median, third quartile, and maximum of the variable
interquartile range
the quantity [latex]Q3-Q1[/latex]. Sometimes denoted [latex]IQR[/latex]
left-skewed (negative skew)
most of the data is bunched up to the right of the graph with a “tail” of infrequent values on the left (lower) end of the distribution
lower outlier
an observation that is less than [latex]Q1-1.5(IQR)[/latex]
mean
an average of a set of values calculated by adding the values and then dividing the total by the number of values in the data set
median
the “middlemost” value of a set of values listed in numerical order
outlier
an unusual or extreme value, given the other values in the data set
range
the maximum (or largest) value – the minimum (or smallest) value
resistant
not affected by the skewness of a graph
right-skewed (positive skew)
most of the data is bunched up to the left of the graph with a “tail” of infrequent values on the right (upper) end of the distribution
standard deviation
a measure of how spread out observations are from the mean
standardized value
the number of standard deviations an observation is away from the mean. Also referred to as a z-score
symmetric
the left and right sides of the distribution (closely) mirror each other. If you drew a vertical line down the center of the distribution and folded the distribution in half, the left and right sides would closely match one another
third quartile
the value below which three quarters of the data lay, also equal to the 75th percentile; sometimes denoted as [latex]Q3[/latex]
upper outlier
an observation that is greater than [latex]Q3+1.5(IQR)[/latex]
variability
a measure of how dispersed (spread out) the data are. It is often referred to as the spread, or dispersion, of a data set
variance
the standard deviation squared