The replication crisis is good for science

The replication crisis is good for science
Some studies don’t hold up to added scrutiny. Credit: PORTRAIT IMAGES ASIA BY NONWARIT/

Science is in the midst of a crisis: A surprising fraction of published studies fail to replicate when the procedures are repeated.

For example, take the study, published in 2007, that claimed that tricky math problems requiring careful thought are easier to solve when presented in a fuzzy font. When researchers found in a small study that using a fuzzy font improved performance accuracy, it supported a claim that encountering perceptual challenges could induce people to reflect more carefully.

However, 16 attempts to replicate the result failed, definitively demonstrating that the original claim was erroneous. Plotted together on a graph, the studies formed a perfect bell curve centered around zero effect. As is frequently the case with failures to replicate, of the 17 total attempts, the original had both the smallest sample size and the most extreme result.

The Reproducibility Project, a collaboration of 270 psychologists, has attempted to replicate 100 psychology studies, while a 2018 report examined studies published in the prestigious scholarly journals Nature and Science between 2010 and 2015. These efforts find that about two-thirds of studies do replicate to some degree, but that the strength of the findings is often weaker than originally claimed.

Is this bad for ? It's certainly uncomfortable for many scientists whose work gets undercut, and the rate of failures may currently be unacceptably high. But, as a psychologist and a statistician, I believe confronting the crisis is good for science as a whole.

Practicing good science

First, these replication attempts are examples of good science operating as it should. They are focused applications of the , careful experimentation and observation in the pursuit of reproducible results.

Many people incorrectly assume that, due to the "p<.05" threshold for statistical significance, only 5% of discoveries will prove to be errors. However, 15 years ago, physician John Ioannidis pointed to some fallacies in that assumption, arguing that false discoveries made up the majority of the published literature. Replication efforts are confirming that the false discovery rate is much higher than 5%.

Awareness about the replication crisis appears to be promoting better behavior among scientists. Twenty years ago, the cycle for publication was basically complete after a scientist convinced three reviewers and an editor that the work was sound. Yes, the published research would become part of the literature, and therefore open to review – but that was a slow-moving process.

Today, the stakes have been raised for researchers. They know that there's the possibility that their study might be reviewed by thousands of opinionated commenters on the internet or by a high-profile group like the Reproducibility Project. Some journals now require scientists to make their data and computer code available, which makes it likelier that others will catch errors in their work. What's more, some scientists can now "preregister" their hypotheses before starting their study – the equivalent of calling your shot before you take it.

Combined with open sharing of materials and data, preregistration improves the transparency and reproducibility of science, hopefully ensuring that a smaller fraction of future studies will fail to replicate.

While there are signs that scientists are indeed reforming their ways, there is still a long way to go. Out of the 1,500 accepted presentations at the annual meeting for the Society for Behavioral Medicine in March, only 1 in 4 of the authors reported using these open science techniques in the work they presented.

Improving statistical intuition

Finally, the replication crisis is helping improve scientists' intuitions about statistical inference.

Researchers now better understand how weak designs with high uncertainty – in combination with choosing to publish only when results are statistically significant – produce exaggerated results. In fact, it is one of the reasons more than 800 scientists recently argued in favor of abandoning statistical significance testing.

We also better appreciate how isolated research findings fit into the broader pattern of results. In another study, Ionnadis and oncologist Jonathan Schoenfeld surveyed the epidemiology literature for studies associating 40 common food ingredients with cancer. There were some broad consistent trends – unsurprisingly, bacon, salt and sugar are never found to be protective against cancer.

But plotting the effects from 264 studies produced a confusing pattern. The magnitudes of the reported effects were highly variable. In other words, one study might say that a given ingredient was very bad for you, while another might conclude that the harms were small. In many cases, the studies even disagreed on whether a given ingredient was harmful or beneficial.

Each of the studies had at some point been reported in isolation in a newspaper or a website as the latest finding in health and nutrition. But taken as a whole, the evidence from all the studies was not nearly as definitive as each single study may have appeared.

Schoenfeld and Ioannidis also graphed the 264 published effect sizes. Unlike the fuzzy font replications, their graph of published effects looked like the tails of a bell curve. It was centered at zero with all the nonsignificant findings carved out. The unmistakable impression from seeing all the published nutrition results presented at once is that many of them might be like the fuzzy font result – impressive in isolation, but anomalous under replication.

The breathtaking possibility that a large fraction of published research findings might just be serendipitous is exactly why people speak of the replication crisis. But it's not really a scientific crisis, because the awareness is bringing improvements in research practice, new understandings about statistical inference and an appreciation that isolated findings must be interpreted as part of a larger pattern.

Rather than undermining science, I feel that this is reaffirming the best practices of the scientific method.

Explore further

Testing the reproducibility of social science research

Provided by The Conversation

This article is republished from The Conversation under a Creative Commons license. Read the original article.The Conversation

Citation: The replication crisis is good for science (2019, April 8) retrieved 22 July 2019 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Feedback to editors

User comments

Apr 09, 2019
Many people incorrectly assume that, due to the "p<.05" threshold for statistical significance, only 5% of discoveries will prove to be errors.

Consider, if you run a public health campaing of HIV screening with an error rate of 5% both ways, and 1% of the population actually has HIV, then over every 2000 people you sample, you would expect to find 100 positive test results for every 19 confirmed cases and 1 missed infection.

So as you are sampling the people, you actually have 5:1 chances they aren't actually HIV positive even though the test says so. Your noise exceeds the signal when the effect is weak, and based on this test alone you'd conclude that we have a HIV epidemic.

This isn't very difficult to see - it's just that social sciences and all the other "soft sciences" deliberately choose to apply more relaxed criteria or they couldn't publish much anything. Some of the blame is in the people who use the academia to drive political causes.

Apr 09, 2019
So you get a second opinion. Does it not work that way in your country, @Eikka?

Or is it that you ignore clinical outcomes and insist on applying dangerous therapies based on the results of single tests?

Apr 09, 2019
There's no real surprise in this article, as psychology (and even medicine) have the lowest standards for accepting something as a finding (95% - i.e. 2 sigma - confidence is enough...which means that up to 1 in 20 studies are going to report a false positive, anyhow. Plus studies where someone makes mistakes. Scientist aren't infallible gods.)

This isn't because PhD students/postdocs in psychologists or pharma are bad at doing science but because the studies can't be done - with the bounds of "reasonable effort" - to the standards of, say, physics which usually aims for a six-sigma figure (roughly one half in a billion chance of being wrong) before claiming a discover.

It's easy to see that having several billion participants for a clinical/psychological study (and isolating them from all external factors to boot) isn't feasible, whereas smashing several billion particles together and looking at the results isn't as much of an issue once you have a collider up and running.

Apr 09, 2019
So you get a second opinion. Does it not work that way in your country, @Eikka?

Yes. That's called replication. The crisis is that replication isn't, or hasn't been done, and everyone has been taking the first result as gospel - especially as it has been confirming their established views, or the agenda they've been trying to push, but mostly because the social sciences are to a large extent diploma mills and the quality of their science is rather secondary.

Apr 09, 2019
(95% - i.e. 2 sigma - confidence is enough...which means that up to 1 in 20 studies are going to report a false positive

That's not quite how it works. Read the article again:

Many people incorrectly assume that, due to the "p<.05" threshold for statistical significance, only 5% of discoveries will prove to be errors.

The p-value is the probability to reject the null hypotesis ("no effect") given that it is true. Given that the null hypothesis is NOT true, the statistical significance test doesn't say anything about the confidence in the actual results of the study. You may still be grossly in error, e.g. by measuring five times more HIV cases than there really are, and raising an alarm for an epidemic over noisy measurements or non-representative samples.

The statistical significance test doesn't say whether your science is correct, it merely tells you whether you have enough statistical evidence to say that there's -anything- in there

Apr 09, 2019
Well, psychology is the worst affected science . but Ioannidis was wrong in allowing for a majority of false discoveries. The error rate is something like 30 %, and much better in traditional sciences.

But even those, where signals are stronger and/or large statistics is cheaper, can be improved of course. There the current discussion is on publishing negative results fairly, in an unbiased way.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more