Name the features of the distribution of a data set using statistical language
Describe the connection between the distribution of a data set and its mean and median
Recall that we think of the mean as the “average” data value and the median as the [latex]50[/latex]th percentile, the value that splits the data in half. Let’s say the mean of a data set is given as [latex]10.5[/latex] and the median as [latex]11[/latex]. Which of the following statements are true? Explain.
The median tells us a typical value for this data set. That is, if we took all the values and spread them evenly about, each value would be about [latex]11[/latex].
About half the data values fall below [latex]11[/latex] and half fall above.
The most common data value appearing is [latex]10.5[/latex].
A typical data value for this set is [latex]10.5[/latex]. That is, if we distributed the sum of all the values evenly, each value would be about [latex]10.5[/latex].
False. The median represents the [latex]50[/latex]th percentile, with about half the values falling above [latex]11[/latex] and half below.
True. The median is [latex]11[/latex].
Cannot be determined. The mode tells us the most common data value. Neither the mean nor the median gives us that information.
True. The mean is [latex]10.5[/latex], which we can consider to be the “average” data value.
Recall the data set about employee salaries.
Suppose that the first data set lists the monthly salaries (in thousands of dollars) for all six employees at a company during the month of January. For example, Employee [latex]1[/latex] made [latex]\$4,000[/latex] in January, Employee [latex]2[/latex] made [latex]\$6,000[/latex], and so on. We’ll consider this amount the regular salary per month for each of these employees.
The second data set lists the monthly salaries (in thousands of dollars) for the same six employees during the month of February.
Employee
Monthly Salary in January
(in thousands of dollars)
Monthly Salary in February
(in thousands of dollars)
Employee 1
[latex]4[/latex]
[latex]4[/latex]
Employee 2
[latex]6[/latex]
[latex]8[/latex]
Employee 3
[latex]3[/latex]
[latex]3[/latex]
Employee 4
[latex]5[/latex]
[latex]5[/latex]
Employee 5
[latex]6[/latex]
[latex]6[/latex]
Employee 6
[latex]3[/latex]
[latex]3[/latex]
We saw that the median and the mean employee salaries for January were the same. What can we understand from that information?
The median of the data set implies that ____________ made more than [latex]\$4,500[/latex] in January and _________ made less.
The mean of the data set implies that if the January salaries had been added up and evenly distributed across all six employees, each person would have received ________________.
Half the employees made more than [latex]\$4,500[/latex] and half made less.
Each person would have received [latex]\$4,500[/latex]. That is, the average salary was [latex]\$4,500[/latex] for January.
It was interesting that the mean and the median were identical values. This tells us that the the salaries were evenly distributed among high and low values and the distribution was symmetrical, without skew.
But what happens if we change one of the values in the data set?
Comparing Mean and Median
Let’s look at the data set of employee salaries from February which includes a big raise for one employee. How will the mean of the February salaries compares to the mean of the January salaries?
Employee
Monthly Salary in February
(in thousands of dollars)
Employee 1
[latex]4[/latex]
Employee 2
[latex]8[/latex]
Employee 3
[latex]3[/latex]
Employee 4
[latex]5[/latex]
Employee 5
[latex]6[/latex]
Employee 6
[latex]3[/latex]
Was the mean you calculated for February salaries higher, lower, or similar? What do you think caused that to be true?
The mean is now higher than the median. They were identical in January.
Did the increase in one salary cause the mean to rise?
Would that always happen if a data value increases?
How could we predict mathematically how much the mean would increase under the increase of a single value?
We could predict the increase in mean mathematically by taking the difference in the January salary and the February salary then distributing that difference out evenly among the employees.
Ex. One salary increased by [latex]$2,000[/latex]. If we divide the [latex]$2,000[/latex] across all six employees, we’ll have the amount by which the new mean is higher.
[latex]\dfrac{$2,000}{6}=\$333.33[/latex]
For January, [latex]\bar{x}=$4,500[/latex] and for February, [latex]\bar{x}=$4,833.33[/latex]. The mean increased by $333.33.
But why did the median stay the same? Would the median always be roughly the same if a data value changes?
If the middle-most number or two numbers didn’t change, the median won’t change.
What would happen though, if instead of Employee 2 receiving the raise, Employee 1 had received it instead? What would the new median be?
The January median of the data set [latex]3, 3, 4, 5, 6, 6[/latex] is the mean of [latex]4[/latex] and [latex]5[/latex] in thousands of dollars, or [latex]\$4,500[/latex]
Changing one of the salaries from [latex]6[/latex] thousand to [latex]8[/latex] thousand didn’t affect the middle two numbers.
But changing the [latex]4[/latex] to an [latex]8[/latex] would require the reordering of the values.
[latex]3, 3, 5, 6, 6, 8[/latex] now yields a median of [latex]5.5[/latex]
Now let’s consider a slightly different question.
It may take some time before you really feel comfortable interpreting means and medians and understanding what they imply about a data set. A key idea to take from this activity is that while the median stays relatively fixed in a data set, if one value changes by a large amount, the mean does not. This tells us that the mean is sensitive to the presence of extreme values in the data set.