Describe the differences in variability in histograms and dotplots.
Calculate and describe standard deviation.
Deciding Which Measurements to Use
We now have a choice between two measurements of center and spread: We can use the median with the interquartile range, or we can use the mean with the standard deviation. How do we decide which measurements to use?
Our next examples show that the shape of the distribution and the presence of outliers help us answer this question.
This boxplot is a summary of homework scores earned by a student. Notice that the distribution of scores has an outlier. This student has mostly high homework scores with one score of [latex]0[/latex].
Figure 1. Boxplot showing a student’s homework scores, with most scores clustered high and one outlier at 0, indicating a much lower value than the rest.
Here are some observations about the homework data:
The “typical” range of scores based on the first and third quartiles is [latex]82[/latex] to [latex]89[/latex].
The typical range of scores based on mean ± SD is [latex]64.2[/latex] to [latex]99.4[/latex] (Here’s how we calculated this: [latex]81.8 - 17.6 = 64.2, 81.8 + 17.6 = 99.4[/latex].)
Which is the better summary of the student’s performance on homework?
The typical range based on the mean and standard deviation is not a good summary of this student’s homework scores. Here we see that the outlier decreases the mean so that the mean is too low to be representative of this student’s typical performance. We also see that the outlier increases the standard deviation, which gives the impression of a wide variability in scores. This makes sense because the standard deviation measures the average deviation of the data from the mean. So, a point that has a large deviation from the mean will increase the average of the deviations. In this example, a single score is responsible for giving the impression that the student’s typical homework scores are lower than they really are.
The typical range based on the first and third quartiles gives a better summary of this student’s performance on homework because the outlier does not affect the quartile marks.
The better numerical summaries of student’s performance on this homework data set are the five-number summary (which includes median), IQR, and range.
These examples illustrate some general guidelines for choosing numerical summaries:
Like the mean, the standard deviation is strongly affected by outliers and skew in the data. Therefore, use the mean and the standard deviation as measures of center and spread only for distributions that are reasonably symmetric with a central peak. When outliers are present, the mean and standard deviation are not a good choice.
Use the five-number summary (which includes the median, IQR, and range) for all other cases.
Both of these examples also highlight another important principle: Always plot the data.
We need to use a graph to determine the shape of the distribution. By looking at the shape, we can determine which measures of center and spread best describe the data.