Data Exploration: Learn It 1

  • Use a spreadsheet to calculate the mean, median, and mode for a data set
  • Use a spreadsheet to calculate the range and standard deviation for a data set
  • Create graphical comparisons between two data sets to identify differences in their characteristics

Data Exploration: [latex]3[/latex]x[latex]3[/latex] Speed Cube Competition

Decorative image of a Rubik's cube

 

A group of [latex]50[/latex] students at a hypothetical math camp was given instruction in an algorithm used to teach beginners how to solve a [latex]3[/latex]x[latex]3[/latex] speed cube. None had ever learned a method for solving one before. After a week of practice, they were each asked to solve [latex]10[/latex] cubes and record their average solving time. Their average times to solve [latex]10[/latex] cubes in seconds are listed below.

[latex]129[/latex] [latex]122[/latex] [latex]115[/latex] [latex]100[/latex] [latex]108[/latex] [latex]129[/latex] [latex]131[/latex] [latex]145[/latex] [latex]118[/latex] [latex]96[/latex]
[latex]149[/latex] [latex]95[/latex] [latex]118[/latex] [latex]158[/latex] [latex]131[/latex] [latex]135[/latex] [latex]145[/latex] [latex]98[/latex] [latex]123[/latex] [latex]161[/latex]
[latex]166[/latex] [latex]110[/latex] [latex]145[/latex] [latex]96[/latex] [latex]157[/latex] [latex]158[/latex] [latex]117[/latex] [latex]115[/latex] [latex]103[/latex] [latex]145[/latex]
[latex]143[/latex] [latex]94[/latex] [latex]97[/latex] [latex]154[/latex] [latex]162[/latex] [latex]111[/latex] [latex]110[/latex] [latex]137[/latex] [latex]150[/latex] [latex]145[/latex]
[latex]119[/latex] [latex]155[/latex] [latex]112[/latex] [latex]129[/latex] [latex]140[/latex] [latex]139[/latex] [latex]164[/latex] [latex]158[/latex] [latex]134[/latex] [latex]98[/latex]

 

 

We’d like to explore the shape, center, and variability of the data to gain some understanding. Let’s start by opening a new workbook and saving it somewhere safe. You should save your work frequently or enable an auto-save feature. Rename Sheet1 as Speed Cube Times.

Spreadsheet Hands-On: Use a Spreadsheet to Perform Descriptive Statistics

The examples shown below will use the Microsoft Excel spreadsheet but you can also use an open-source spreadsheet such as Apache OpenOffice Calc or Google Sheets.

Step [latex]1[/latex]: Load the data set

  1. Type the [latex]50[/latex] numbers from the table above into the spreadsheet in column A. You may be able to copy them from the text page and paste them into the worksheet. If so, you’ll need to transpose the the rows into columns. To do so, after pasting the data in the spreadsheet, you’ll have [latex]5[/latex] rows of [latex]10[/latex] columns. Select the first row, copy it, and paste it elsewhere on the sheet. Click the Ctrl (paste options) box and choose Transpose in the uppermost Paste options box. It’s the one with the perpendicular arrows. Continue until you have all five rows in column A in cells [latex]1[/latex] through [latex]50[/latex]. Then, delete the tabular data.
  2. The numbers don’t need to be in any particular order, but if you’d like to arrange them, you can. Select cells A1 – A50 where the data is. Click on Data in the ribbon above the sheet, then click on Sort. Make sure that “My data has headers” is unchecked. Then sort by Column A on Cell Values, Smallest to Largest, and click OK.
Screenshot of Microsoft Excel highlighting the Data and Sort buttons.
  1. Check to see if you have the Excel Data Analysis ToolPak loaded into your copy of the program. Click on Data and look for the Data Analysis tab on the ribbon (see the image below). If it is there, you already have the ToolPak. If it is not there, click File, then Options. Choose Add-ins in the menu that opens, then click Go… Click Analysis ToolPak on the next menu, then click OK. Once you add it, you won’t have to add it again.
Screenshot of Excel highlighting the Data and Data Analysis buttons

 

The next step is to get a list of descriptive statistics from the Data Analysis ToolPak.

Step [latex]2[/latex]: Obtain the descriptive statistics from the ToolPak.

  1. Excel and other spreadsheets have function that will enable you to obtain descriptive statistics one by one. That’s time-consuming, so we’ll use the descriptive statistics feature of the Statistics ToolPak. Click Data, then Data Analysis, and choose Descriptive Statistics.
  2. Click into the box next to Input Range to give your data to the tool. Then select your data, or just type $A$1:$A$50 (assuming your data is in column A in rows 1 – 50).
  3. Click in the Output Range box and choose a cell to put your summary in, like $C$1.
  4. Click to choose Summary statistics, Kth Largest (set to [latex]2[/latex]), and Kth Smallest (set to [latex]2[/latex]), and click OK. A column of statistics should have appeared, labeled Column [latex]1[/latex]. You can click that and rename it if you like.
A screenshot of the summary statistics in Excel

 

  1. Note that the Minimum value is [latex]94[/latex] and the Maximum is [latex]166[/latex]. We also checked the boxes to be given the Largest ([latex]1[/latex]) at [latex]166[/latex] and the Smallest([latex]1[/latex]) at [latex]94[/latex]. You could have chosen any number for largest and smallest, say [latex]2[/latex] or [latex]3[/latex] away from the ends of the range, just to get a feel for any outliers (unusually large or small values) in the set.
  2. We can read mean, median, and mode from the data, the standard deviation, and the range. Another statistic of interest will be skewness.

Now that our data is loaded and groomed, we’d like to start looking at it from different angles. Getting practice looking at various representations of data will help you learn what views work best with different types of information. We will create a frequency table and a histogram first. 

You learned about creating Frequency Tables and Histograms by hand earlier in the text. Now we’ll let the program do the heavy lifting for us. But there is some preparation we must do first. You may recall that a frequency table has two columns: one for the categories and one for the frequencies. We’ll need to decide on some categories that make sense for our particular quantitative data and the class interval for each.

Step [latex]3[/latex]: Create a Frequency Table and a Histogram

Recall the details of our data set. The numbers represent the average times of novice cube solvers after a week of instruction. Perhaps we are interested in how our class of [latex]50[/latex] students has progressed and how we might like to tweak the subsequent instruction. Different representations of the spread and center of the data may yield some insight. A histogram could help us sense for how long it’s taking the students to solve a cube on average by categorizing them by equal intervals of length in seconds. But what is the best way to divide them up? Examining the range is a good place to start.

The range of the data is [latex]72[/latex]. That is, there are [latex]73[/latex] values to distribute including the fastest time of [latex]94[/latex] seconds and the slowest time of [latex]166[/latex] seconds. We want to choose class intervals (the width of each bin) that make sense for our data but choosing the intervals is a subjective task. There is a rule of thumb that says that the bins shouldn’t be too narrow or too wide to make it difficult to see the shape of the distribution. Sometimes it’s best to play around with different widths to choose the best representation.  There are statistical rules that can help choose the intervals. Two that are commonly used are Sturges’ Rule and Rice’s Rule.

handling roots in the calculator

Recall that [latex]{\sqrt[n]{a}}^{m} = a^{\frac{m}{n}}[/latex]

This allows us to write [latex]\sqrt[3]{a} = {a}^{\frac{1}{3}}[/latex]

That is, to evaluate a cube root such as [latex]\sqrt[3]{8}[/latex] in the calculator, raise 8 to the [latex]\dfrac{1}{3}^{rd}[/latex] power. Try it. [latex]8\wedge(1/3)[/latex]. You should get 2.

  • Sturges’ Rule states that the number of bins should be twice the cube root of the sample size, [latex]n[/latex]. That’s [latex]2\ast \sqrt[3]{n}[/latex]. In this case, [latex]2\ast \sqrt[3]{50}\approx7.4[/latex]
  • Rice’s Rule states that the number of bins should be equivalent to [latex]1+3.3\log n[/latex]. In this case, [latex]1+3.3\log50 \approx 6.6[/latex]

We can compromise on [latex]7[/latex], which gives us a class interval of [latex]10.4[/latex]. So, we’ll set the bins to be [latex]11[/latex] seconds wide and run from [latex]92[/latex] – [latex]168[/latex].

  1. Excel will only accept the rightmost edge of the bin, so we’ll need two columns. Type out the bins and rightmost bin edges for Excel in columns F and G as shown in the image below:
Screenshot of the data in Excel

 

  1. Next, click on Data Analysis and choose Histogram. Set the Input Range to $A$1:$A$50 to feed it the data. Set the Bin Range to $G$1:$G$8 and uncheck the Labels box. Set the Output Range to $I$1 and click Chart Output. Click OK.
  2. Next, create the Frequency Table. Select and copy the Frequency column and paste somewhere in the middle of the sheet. Then copy the Times column from column F and paste it adjacent to the pasted Frequencies. Adjust your formatting for font and style.
  3. We need to do a little housekeeping to the Histogram to remove the blank spaces. Right click on a data bar in the histogram. Choose “format data series,” and select Series Options (it may already be selected for you). Make the Gap Width [latex]0%[/latex]. Then choose the paint bucket and add Solid line.
Screenshot of the data, summary statistics, frequency, and a histogram in Excel

 

  1. The frequency table and histogram both indicate that the data are fairly evenly distributed in frequency. The mean and median are in agreement at about [latex]130[/latex] which indicates that the data is distributed symmetrically about the center with a standard deviation of [latex]22[/latex]. The skewness also indicates a symmetric distribution, since the skew value is very close to zero. That means we don’t have any large outliers to pull the mean in one direction or the other. Since about [latex]68%[/latex] of the data in a normal distribution lie within [latex]1[/latex] standard deviation from the center, we can be fairly certain that most of the students are solving the cube in between [latex]108 - 152[/latex] seconds. The mode is [latex]145[/latex], so there is a peak of students at that mark, on average. That doesn’t mean that most students are solving the cube at [latex]145[/latex] seconds. It just means that of all the recorded solving times, [latex]145[/latex] came up the most. A dot plot would make the mode easier to see.

Step [latex]4[/latex]: Create a dot plot.

  1. The dot plot will have a horizontal axis containing all the integers between 94 and 166. We’ll need to get frequencies for each number appearing in the data set. Essentially, we’ll move through the set, number by number, and place a dot on the chart over each number in data set. Well, we’ll have Excel do it for us. Click into cell B[latex]1[/latex]. We will use an Excel formula called COUNTIF().
  2. Into cell B1, type COUNTIF($A$1:$A1,A1), then enter. That tells Excel to increment the count in the cell next to each data element by one if it encounters it. Select the cell again and drag that formula down through the column to B50.
  3. Now highlight both columns, A and B through row [latex]50[/latex]. Click on Insert in the ribbon and then click Charts, then scatter chart (the one without lines or curves). The mode stands out as the tallest vertical stack of data elements on the chart.
Screenshot of the data, summary statistics, frequency, a histogram, and a dotplot in Excel

 

There are more explorations we can make of this data, but they will just uncover what we’ve found so far already. Most of our students have done well and are solving the cube after one week somewhere between [latex]108[/latex] and [latex]152[/latex] seconds, with a handful of quick learners and a handful of students lagging behind. That’s nearly [latex]\frac{3}{4}[/latex] of a minute spread around the center, though. There is no clear pattern showing for the class as a whole. You may wish to revisit this data after practicing some other methods in the next sections to see if you gain more insight.