How big data has created a big crisis in science

December 13, 2018 by Kai Zhang, The Conversation
Scientists are facing a reproducibility crisis. Credit: Y Photo Studio/

There's an increasing concern among scholars that, in many areas of science, famous published results tend to be impossible to reproduce.

This crisis can be severe. For example, in 2011, Bayer HealthCare reviewed 67 in-house projects and found that they could replicate less than 25 percent. Furthermore, over two-thirds of the projects had major inconsistencies. More recently, in November, an investigation of 28 major psychology papers found that only half could be replicated.

Similar findings are reported across other fields, including medicine and economics. These striking results put the credibility of all scientists in deep trouble.

What is causing this big problem? There are many contributing factors. As a statistician, I see huge issues with the way science is done in the era of . The reproducibility crisis is driven in part by invalid that are from data-driven hypotheses – the opposite of how things are traditionally done.

Scientific method

In a classical experiment, the statistician and scientist first together frame a hypothesis. Then scientists conduct experiments to collect data, which are subsequently analyzed by statisticians.

A famous example of this process is the "lady tasting tea" story. Back in the 1920s, at a party of academics, a woman claimed to be able to tell the difference in flavor if the tea or milk was added first in a cup. Statistician Ronald Fisher doubted that she had any such talent. He hypothesized that, out of eight cups of tea, prepared such that four cups had milk added first and the other four cups had tea added first, the number of correct guesses would follow a probability model called the hypergeometric distribution.

Such an experiment was done with eight cups of tea sent to the lady in a random order – and, according to legend, she categorized all eight correctly. This was strong evidence against Fisher's hypothesis. The chances that the lady had achieved all correct answers through random guessing was an extremely low 1.4 percent.

That process – hypothesize, then gather data, then analyze – is rare in the big data era. Today's technology can collect huge amounts of data, on the order of 2.5 exabytes a day.

While this is a good thing, science often develops at a much slower speed, and so researchers may not know how to dictate the right hypothesis in the analysis of data. For example, scientists can now collect tens of thousands of gene expressions from people, but it is very hard to decide whether one should include or exclude a particular gene in the hypothesis. In this case, it is appealing to form the hypothesis based on the data. While such hypotheses may appear compelling, conventional inferences from these hypotheses are generally invalid. This is because, in contrast to the "lady tasting tea" process, the order of building the hypothesis and seeing the data has reversed.

Data problems

Why can this reversion cause a big problem? Let's consider a big data version of the tea lady—a "100 ladies tasting tea" example.

Suppose there are 100 ladies who cannot tell the difference between the tea, but take a guess after tasting all eight cups. There's actually a 75.6 percent chance that at least one lady would luckily guess all of the orders correctly.

Now, if a scientist saw some lady with a surprising outcome of all correct cups and ran a statistical analysis for her with the same hypergeometric distribution above, then he might conclude that this lady had the ability to tell the difference between each cup. But this result isn't reproducible. If the same lady did the experiment again she would very likely sort the cups wrongly – not getting as lucky as her first time – since she couldn't really tell the difference between them.

This small example illustrates how scientists can "luckily" see interesting but spurious signals from a dataset. They may formulate hypotheses after these signals, then use the same dataset to draw the conclusions, claiming these signals are real. It may be a while before they discover that their conclusions are not reproducible. This problem is particularly common in big data analysis due to the large size of data, just by chance some spurious signals may "luckily" occur.

What' worse, this process may allow scientists to manipulate the data to produce the most publishable result. Statisticians joke about such a practice: "If we torture data hard enough, they will tell you something." However, is this "something" valid and reproducible? Probably not.

Stronger analyses

How can scientists avoid the above problem and achieve reproducible results in big data analysis? The answer is simple: Be more careful.

If scientists want reproducible results from data-driven hypotheses, then they need to carefully take the data-driven process into account in the analysis. Statisticians need to design new procedures that provide valid inferences. There are a few already underway.

Statistics is about the optimal way to extract information from data. By this nature, it is a field that evolves with the evolution of data. The problems of the big data era are just one example of such evolution. I think that scientists should embrace these changes, as they will lead to opportunities to develop of novel statistical techniques, which will in turn provide valid and interesting scientific discoveries.

Explore further: Researchers look to add statistical safeguards to data analysis and visualization software

Related Stories

Scholars take aim at false positives in research

September 4, 2017

A single change to a century-old statistical standard would dramatically improve the quality of research in many scientific fields, shrinking the number of so-called false positives, according to a commentary published Sept. ...

Recommended for you

Tiny 'water bears' can teach us about survival

March 20, 2019

Earth's ultimate survivors can weather extreme heat, cold, radiation and even the vacuum of space. Now the U.S. military hopes these tiny critters called tardigrades can teach us about true toughness.

A decade on, smartphone-like software finally heads to space

March 20, 2019

Once a traditional satellite is launched into space, its physical hardware and computer software stay mostly immutable for the rest of its existence as it orbits the Earth, even as the technology it serves on the ground continues ...

Researchers find hidden proteins in bacteria

March 20, 2019

Scientists at the University of Illinois at Chicago have developed a way to identify the beginning of every gene—known as a translation start site or a start codon—in bacterial cell DNA with a single experiment and, through ...

Turn off a light, save a life, says new study

March 20, 2019

We all know that turning off lights and buying energy-efficient appliances affects our financial bottom line. Now, according to a new study by University of Wisconsin-Madison researchers, we know that saving energy also saves ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.