Statistical Thinking: Learn It 2—Statistical Significance

Statistical Significance and P-Values

Even when we find patterns in data, often there is still uncertainty in various aspects of the data. For example, there may be potential for measurement errors (even your own body temperature can fluctuate by almost 1 °F over the course of the day), or we may need to make generalizations about the entire population based on a small snapshot of data. In such cases, we use statistics to help us understand the probability that our data is telling the right story.

Example: Do Babies Prefer Helpers?

In a classic study published in Nature (Hamlin, Wynn, & Bloom, 2007), researchers asked a fascinating question:

Can infants judge others based on helpful or harmful behavior?

In one version of the experiment, 10-month-old babies watched a simple puppet show:

  • A wooden “climber” with googly eyes tried—and failed—to climb a hill.
  • In one scenario, another character helped push the climber up the hill.
  • In another, a different character pushed the climber back down.

After several repetitions, the infants were shown both toys—the helper and the hinderer—and invited to choose one to play with.[1]

Images of the little figures shown to infants in the experiment—a red circle with googly eyes in the main character, and in the first situation, a blue square helps it up a hill, and in the second situation, a yellow triangle pushes it down the hill.
Figure 1. In the research study, babies were shown a character trying to climb up a hill. In this case, the red circle was trying to climb the hill and the blue square helped it up, while the yellow triangle pushed it down.

Of the 16 infants who made a clear choice, 14 picked the helper toy.

That seems like a strong preference—but scientists must rule out other explanations before drawing conclusions.

The researchers controlled for:

  • Color and shape: Each toy took turns being the helper or hinderer.
  • Position: Half of the infants saw the helper on the left, half on the right.
  • Familiarity: All infants saw the same number of helping and hindering acts.

After accounting for these factors, one question remained: Could the result still be due to random chance?

Understanding Randomness in Research

Even if all infants had no real preference, we wouldn’t expect exactly half to pick each toy every time. Random variation happens naturally.

For instance, if each baby’s choice were like flipping a coin—50% chance helper, 50% chance hinderer—we might sometimes get 9 “helper” choices out of 16, sometimes 7, sometimes 11. But getting 14 out of 16 “helper” choices would be very unusual if the choices were purely random.

P-value

Getting 14 (or more) heads in 16 tosses is about as likely as tossing a coin and getting 9 heads in a row. This probability is referred to as a p-value. The p-value represents the likelihood that experimental results happen by chance. 

What Does “Statistically Significant” Mean?

Psychologists typically use a standard of p < .05 (less than a 5% chance that the result happened randomly) to decide whether results are statistically significant.

If the probability of getting the observed result by chance is smaller than 5%, researchers conclude that the effect is unlikely to be random. Because p = 0.0021 is far smaller than .05, the researchers concluded that infants showed a genuine preference for the helper toy. In other words, the babies’ choices likely reflect something real about social evaluation—not luck.

Statistical significance doesn’t “prove” a theory—it simply tells us that the pattern we observed is very unlikely to have happened by accident.
Researchers still need to replicate studies and consider effect size, design quality, and potential biases before drawing firm conclusions.


  1. Hamlin, J. K., Wynn, K., Bloom, P., & Mahajan, N. (2011). How infants and toddlers react to antisocial others. Proceedings of the National Academy of Sciences, 108(50), 19931-19936. https://doi.org/10.1073/pnas.1110306108