The Replication Crisis: Learn It 4—The “Crisis”

The “Replication Crisis”

In recent years, there has been increased effort in the sciences (psychology, medicine, economics, etc.) to redo previous experiments to test their reliability. The findings have been disappointing at times, particularly in the field of social psychology.

The Reproducibility Project has attempted to replicate 100 studies within the field of psychology that were published with statistically significant results; they found that many of these results did not replicate well. Some did not reach statistical significance when replicated. Others reached statistical significance, but with much weaker effects than in the original study.

The Reproducibility Project, led by University of Virginia psychologist Brian Nosek, reported these results in 2015 on the percentages of research that replicated from prestigious psychology journals:

Journal % of Replicated Findings
Journal of Personality and Social Psychology 23%
Journal of Experimental Psychology: Learning, Memory, and Cognition 48%
Psychological Science, social articles 29%
Psychological Science, cognitive articles 53%
Overall 36%

 

How could this happen?

  • Chance. Psychologists use statistics to confirm that their results did not occur simply because of chance. Within psychology, the most common standard for p-values is “p < .05”. This p-value means that there is less than a 5% probability that the results of an experiment happened just by random chance, and a 95% probability that the results were statistically significant.
    • Even though a published study may reveal statistically significant results, there is still a possibility that those results were random. This is more likely to be an issue in experiments with small sample sizes.
      • For example, if you ask five people if they believe that aliens from other planets visit Earth and regularly abduct humans, you may get three people who agree with this notion—simply by chance. Their answers may, in fact, not be at all representative of the larger population. On the other hand, if you survey one thousand people, there is a higher probability that their belief in alien abductions reflects the actual attitudes of society.
    • Now consider this scenario in the context of replication: if you try to replicate the first study—the one in which you interviewed only five people—there is only a small chance that you will randomly draw five new people with exactly the same (or similar) attitudes. It’s far more likely that you will be able to replicate the findings using another large sample, because it is simply more likely that the findings are accurate.
  • Publication bias. Psychology research journals are far more likely to publish studies that find statistically significant results than they are studies that fail to find statistically significant results. What this means is that studies that yield results that are not statistically significant are very unlikely to get published.
    • Let’s say that twenty researchers are all studying the same phenomenon. Out of the twenty, one gets statistically significant results, while the other nineteen all get non-significant results. The statistically significant result was likely just a result of randomness, but because of publication bias, that one study’s results are far more likely to be published than are the results of the other nineteen.
  • Falsified results. Faked results are only one reason studies may not replicate, but it is the most disturbing reason. We hope faking is rare, but in the past decade, a few shocking cases have turned up where researchers have falsified their data or results.
  • Changing times. Another reason for non-replication is that, while the findings in an original study may be true, they may only be true for some people in some circumstances and not necessarily universal or enduring. Imagine that a survey in the 1950s found a strong majority of respondents to have trust in government officials. Now imagine the same survey administered today, with vastly different results. This example of non-replication does not invalidate the original results. Rather, it suggests that attitudes have shifted over time.
  • Poor replication. Non-replication might be the product of scientist error, with the newer investigation not following the original procedures closely enough. Similarly, the attempted replication study might, itself, have too small a sample size or insufficient statistical power to find significant results.

Note that this “replication crisis” itself does not mean that the original studies were bad, fraudulent, or even wrong. What it means, at its core, is that replication found results that were different from the results of the original studies. These results were sufficiently different that we might no longer be secure in our knowledge of what those results mean. Further replication and testing in other directions might give us a better understanding of why the results were different, but that too will require time and resources.

What later projects found

After the 2015 Reproducibility Project, other large research teams decided to see whether the same problem showed up in different areas of psychology and related sciences. These “Many Labs” and replication projects were designed to be much bigger, more systematic, and more transparent than most individual studies.

Here’s what they found:

  • Many Labs 2 (2018)

    This international collaboration repeated 28 well-known psychology experiments—everything from social priming to moral decision-making—across 125 independent samples in over 20 countries. The researchers found that some classic effects replicated consistently, meaning they were observed again across settings and cultures. However, other effects appeared weaker than the originals or depended on local context, such as cultural background or how questions were worded. This showed that some psychological effects may be real but not universal—they work under certain conditions, not all.[1]

  • Social Sciences Replication Project (2018)

    This team re-ran 21 high-profile studies that had originally appeared in prestigious journals like Nature and Science. They found that about 62% of the studies replicated, but their effects were usually smaller than the first time around. In other words, the general patterns were often correct, but the original studies had likely overestimated the strength of the effects.[2]

  • Experimental Economics Replication Project (2016)

    A similar project in economics repeated 18 well-cited laboratory experiments. They successfully reproduced 61% of them—again, often with smaller or weaker effects than the initial reports.[3]

Overall, across disciplines, replication rates were found to be around 50%, with effects commonly smaller on replication—a sign that original estimates were frequently inflated.[4]

Together, these large-scale projects suggest that many psychological findings are partly reliable but exaggerated when first published. In other words, the replication crisis isn’t a total failure—it’s a recalibration. By repeating studies across many samples and countries, researchers are getting more realistic estimates of how strong (and how generalizable) certain psychological effects actually are.

The replication crisis does not mean psychology is broken or that original studies were “all wrong.” It means we needed—and now have—better tools to verify, calibrate, and understand effects across contexts.

 


  1. Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., Aveyard, M., Axt, J. R., Bahník, Š., Batra, R., Berkics, M., Bernstein, M. J., Berry, D. R., Bialobrzeska, O., Binan, E. D., Bocian, K., Brandt, M. J., Busching, R., … Nosek, B. A. (2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. https://doi.org/10.1177/2515245918810225
  2. Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., … Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z
  3. Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. https://doi.org/10.1126/science.aaf0918
  4. Dreber, A. & Johannesson, M. (2025) A framework for evaluating reproducibility and replicability in economics. Economic Inquiry, 63(2), 338–356. Available from: https://doi.org/10.1111/ecin.13244