Numerical Summaries of Data: Learn It 2

Lumen Learning

Numerical Summaries of Data: Learn It 2

Range

Consider these three sets of quiz scores:

Section A: [latex]5, 5, 5, 5, 5, 5, 5, 5, 5, 5[/latex]

Section B: [latex]0, 0, 0, 0, 0, 10, 10, 10, 10, 10[/latex]

Section C: [latex]4, 4, 4, 5, 5, 5, 5, 6, 6, 6[/latex]

All three of these sets of data have a mean of [latex]5[/latex] and median of [latex]5[/latex], yet the sets of scores are clearly quite different. In section A, everyone had the same score; in section B half the class got no points and the other half got a perfect score, assuming this was a [latex]10[/latex]-point quiz. Section C was not as consistent as section A, but not as widely varied as section B.

This scenario demonstrates that while mean and median provide a sense of the center of the data, they don’t convey how spread out the data is. This is where the concept of range comes into play. There are several ways to measure this “spread” of the data. The first is the simplest and is called the range.

range

The range is the difference between the maximum value and the minimum value of the data set.

[latex]\begin{array}{l} \text{Range} = \text{maximum value} - \text{minimum value} \\ \quad\quad\quad = \text{largest value} - \text{smallest value} \end{array}[/latex]

Range is a value that can describe the spread of the data set. When the range is larger, it indicates more variability in the data. However, range only utilizes two observations in the entire data set to measure variability, so it is not an ideal measure of spread when used alone.

Using the quiz scores from above,

For section A, the range is [latex]0[/latex] since both maximum and minimum are [latex]5[/latex] and [latex]5 – 5 = 0[/latex].
For section B, the range is [latex]10[/latex] since [latex]10 – 0 = 10[/latex].
For section C, the range is [latex]2[/latex] since [latex]6 – 4 = 2[/latex].

In the last example, the range seems to be revealing how spread out the data is.However, suppose we add a fourth section, Section D, with scores [latex]0, 5, 5, 5, 5, 5, 5, 5, 5, 10[/latex]. This section also has a mean and median of [latex]5[/latex]. The range is [latex]10[/latex], yet this data set is quite different than Section B. To better illuminate the differences, we’ll have to turn to more sophisticated measures of variation.

Calculating Deviation from the Mean

Let’s consider the sample data set [latex]2, 2, 4, 5, 6, 7, 9[/latex].

The mean of this data set is [latex]\stackrel{¯}{x}=\frac{2+2+4+5+6+7+9\text{}}{7}\text{}=\text{}\frac{35}{7}=5[/latex].

Here is a dotplot of this data set with the mean marked by the vertical blue line.

Dotplot of data set with the mean marked by vertical blue line at 5

We can see that some data is close to the mean and some data is further from the mean.

Since we want to see how the data points deviate from the mean, we determine how far each point is from the mean. We compute the difference between each of these values and the mean. These differences are called the deviations.

deviation

In statistics, deviation is a measure of difference between the observed value [latex]x[/latex] of a variable and the mean [latex]\bar{x}[/latex].

Deviation = [latex]x-\bar{x}[/latex]

[latex]x[/latex]	Deviation [latex]=(x-\bar{x})[/latex]
[latex]2[/latex]	[latex]2 − 5 = −3[/latex]
[latex]2[/latex]	[latex]2 − 5 = −3[/latex]
[latex]4[/latex]	[latex]4 − 5 = −1[/latex]
[latex]5[/latex]	[latex]5 − 5 = 0[/latex]
[latex]6[/latex]	[latex]6 − 5 = 1[/latex]
[latex]7[/latex]	[latex]7 − 5 = 2[/latex]
[latex]9[/latex]	[latex] 9 − 5 = 4[/latex]

When visualized on a dotplot, these differences are viewed as distances between each point and the mean. A negative difference indicates that the data point is to the left of the mean (shown in blue on the graph below). A positive difference indicates that the data point is to the right of the mean (shown in green on the graph below).

Dotplot where negative differences are shown as data points to the left of the mean; positive differences are shown as data points to the right

Our goal is to develop a single measurement that summarizes a typical distance from the mean.

Now, let’s practice determining the distance of a single data point from the mean, a.k.a, the deviation from the mean: [latex](x-\bar{x})[/latex].

Standard Deviation

What if we want to have one value to represent all of the deviations? This measurement of variability is called standard deviation, which tells us how spread out observations are from the mean.

standard deviation

The standard deviation is a measure of variation based on measuring how far each data value deviates, or is different, from the mean. A few important characteristics:

Standard deviation is always positive. Standard deviation will be zero if all the data values are equal, and will get larger as the data spreads out.
Standard deviation has the same units as the original data.
Standard deviation, like the mean, can be highly influenced by outliers.

The following formulas are used to calculate the standard deviation of a population and a sample:

Standard deviation of a population: [latex]\sigma = \sqrt{\dfrac{\sum \left(x-\mu\right)^2}{n}}[/latex], where [latex]\mu[/latex] represents the population mean.

Standard deviation of a sample: [latex]s=\sqrt{\dfrac{\sum \left(x-\bar{x}\right)^2}{n-1}}[/latex], where [latex]\bar{x}[/latex] represents the sample mean.

How To: To Compute Standard Deviation

Calculate the mean of the population or sample
Find the deviation of each data from the mean
Square each deviation.
Add the squared deviations.
Divide by:
- the total number of observations ([latex]n[/latex]) in the case of a population
- [latex]1[/latex] fewer than the total ([latex]n-1[/latex]) in the case of a sample
Compute the square root of the result.

In statistics, the terms “population” and “sample” refer to different groups that are studied or analyzed. Here’s a breakdown of the difference between the two:

The population refers to the entire group of individuals, objects, or events that we want to study or draw conclusions about. It represents the larger set or the complete collection of elements that share a common characteristic.
A sample is a subset or a smaller representative group selected from the population. It is a portion of the population that is chosen to gather information and make inferences about the entire population. Samples are often used when it is not feasible or practical to study the entire population.

The key distinction between population and sample lies in their scope. The population encompasses the entire group of interest, while the sample represents a smaller, manageable portion of the population that is used to make inferences or draw conclusions about the larger population.

Let’s consider this small sample data set: [latex]2, 2, 4, 5, 6, 7, 9[/latex].a) Find the mean and the standard deviation of the data set.Center and spread: the mean is [latex]5[/latex] and the standard deviation is [latex]2.58[/latex].

b) Find the “typical” range of values for this data set.

Typical range of values are between [latex]2.42[/latex] and [latex]7.58[/latex].

Variance

Another measure of spread is called variance. Variance is just the squared value of the standard deviation. Similar to standard deviation, the larger the value of the variance, the larger the variability of the data set. The smaller the value of the variance, the smaller variability exists within the data set.

Variance

Variance is the standard deviation squared, [latex]\sigma^{2}[/latex] or [latex]s^{2}[/latex].

Variance of a population: [latex]\sigma^{2}=\dfrac{\sum\left(x-\mu\right)^{2}}{n}[/latex]

Variance of a sample: [latex]s^{2}=\dfrac{\sum\left(x-\bar{x}\right)^{2}}{n-1}[/latex]

Let’s consider the small sample data set: [latex]2, 2, 4, 5, 6, 7, 9[/latex] from the previous example. The steps to calculate variance is the same as the standard deviation, except you stop at step 4. Therefore, the variance of this data set is [latex]6.67[/latex].