Data Organization: Cheat Sheet

Download a PDF of this page here.

Download the Spanish version here.

Essential Concepts

  • When we have information that can be put into different categories, we can use graphs and charts to help us understand the data better.
  • One type of graph is called a bar graph. It uses bars to represent each category, and the length of each bar shows how many items are in that category.
  • Sometimes it’s helpful to organize the categories from the most to the least number of items. This is called a Pareto chart. It makes it easier to compare the categories and see which ones have more or fewer items.
  • Pie charts look like a circle cut into slices, like a pizza. The size of each slice represents the number of items in each category. When we add up all the slices, it should always equal [latex]100\%[/latex].
  • When we want to compare a category across different groups, we can use side-by-side bar charts or stacked bar charts. These charts help us see how the category changes between different populations or groups.
    • Side-by-side bar charts show data for two categories from multiple groups. Each group has multiple bars, one for each category. This helps us compare the categories between the groups.
    • Stacked bar charts also compare categories between groups. Each bar represents a group, and different colors within the bar represent different categories. The height of each color shows the percentage of responses in that category, and when we add up all the colors, it equals [latex]100\%[/latex] for that group.
  • Histograms are a type of graph that is useful when we have a lot of data to show. They group the data into equal-sized “bins,” which look like bars. The bins can be made wider or narrower depending on how many observations we have. Each bin in a histogram shows the number of observations that fall within a certain range, starting from the left edge of the bin. However, the right edge of the bin doesn’t include any numbers.
  • Dotplots are a type of graph that shows how many times each value appears in a set of data. Each observation is represented by a dot on the graph.
  • The mean is also known as the average. To find the mean of a set of numbers, you add up all the numbers in the set and then divide the sum by the total number of numbers. It gives you a sense of the typical value in the data set.
  • The median is the middle value in a set of numbers when they are arranged in order from least to greatest. If there is an odd number of values, the median is the number in the middle. If there is an even number of values, the median is the average of the two middle numbers. The median gives you an idea of the value that is right in the middle.
  • The mode is the value that appears most frequently in a set of numbers. It represents the value that occurs the most often. It’s possible to have more than one mode if there are multiple values that appear with the same highest frequency.
  • The range tells us how spread out the values in a set of numbers are. To find the range, we look at the highest value in the set and subtract the lowest value from it.
  • To calculate deviation from the mean, we find how far each data point is from the average (mean) of the data set by subtracting the mean from each data point. These differences are called deviations.
  • Standard deviation is a measure of how spread out the numbers are from the average. It is always positive and can be influenced by outliers, representing the variation in the data set.
    • Standard deviation of a population:  [latex]\sigma = \sqrt{\dfrac{\sum \left(x-\mu\right)^2}{n}}[/latex]
    • Standard deviation of a sample: [latex]s=\sqrt{\dfrac{\sum \left(x-\bar{x}\right)^2}{n-1}}[/latex]
  • Variance is a measure of how spread out the numbers are in a data set. It is calculated by squaring the standard deviation. A larger variance means more variability in the data, while a smaller variance means less variability.
    • Variance of a population: [latex]\sigma^{2}=\dfrac{\sum\left(x-\mu\right)^{2}}{n}[/latex]
    • Variance of a sample: [latex]s^{2}=\dfrac{\sum\left(x-\bar{x}\right)^{2}}{n-1}[/latex]
  • The five-number summary is a way to summarize a set of numbers using five key values. These values are the minimum, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum. The minimum is the smallest number in the data set, while the maximum is the largest number. The median is the middle value when the numbers are arranged in order. The first quartile is the median of the lower half of the numbers, and the third quartile is the median of the upper half of the numbers. The five-number summary helps us understand the range and distribution of the data.
    • A boxplot is used to show this visually.
  • Skew refers to the shape or asymmetry of a distribution of data. There are three types of skew: left skewed, symmetric, and right skewed.

Glossary

first quartile

the median of the values that lie below the median for the whole data set

left skewed

a cluster of data on the right with a tail of data tapering off to the left

mean

the mean of a set of [latex]n[/latex] numbers is the arithmetic average of the numbers

median

the middle value, half the data values are less than or equal to the median and half the data values are greater than or equal to the median

mode

the number with the highest frequency

right skewed

a cluster of data on the left with a tail of data tapering off to the right

symmetric

a cluster of data where the left and right sides of the distribution closely mirror each other

third quartile

the median of the values that lie above the median for the whole data set

[latex]5[/latex]-number summary

Minimum, [latex]Q1[/latex], Median, [latex]Q3[/latex], Maximum

Key Equations

mean

[latex]\text{mean}={\Large\frac{\text{sum of values in data set}}{n}}[/latex]

standard deviation of a population

[latex]\sigma = \sqrt{\dfrac{\sum \left(x-\mu\right)^2}{n}}[/latex]

standard deviation of a sample

[latex]s=\sqrt{\dfrac{\sum \left(x-\bar{x}\right)^2}{n-1}}[/latex]

variance of a population

[latex]\sigma^{2}=\dfrac{\sum\left(x-\mu\right)^{2}}{n}[/latex]

variance of a sample

[latex]s^{2}=\dfrac{\sum\left(x-\bar{x}\right)^{2}}{n-1}[/latex]