August 6, 2015

New algorithm aimed at combating science's reproducibility problem

Big data sets are important tools of modern science. Mining for correlations between millions of pieces of information can reveal vital relationships or predict future outcomes, such as risk factors for a disease or structures of new chemical compounds.

These mining operations are not without risk, however. Researchers can have a tough time telling when they have unearthed a nugget of truth, or what amounts to fool's gold: a correlation that seems to have predictive value but actually does not, as it results just from random chance.

A research team that bridges academia and industry has developed a new mining tool that can help tell these nuggets apart. In a study published in Science, they have outlined a method for successively testing hypotheses on the same data set without compromising statistical assurances that their conclusions are valid.

Existing checks on this kind of "adaptive analysis," where new hypotheses based on the results of previous ones are repeatedly tested on the same data, can only be applied to very large datasets. Acquiring enough data to run such checks can be logistically challenging or cost prohibitive.

The researchers' method could increase the power of analysis done on smaller datasets, by flagging ways researchers can come to a "false discovery," where a finding appears to be statistically significant but can't be reproduced in new data.

For each hypothesis that needs testing, it could act as a check against "overfitting", where predictive trends only apply to a given dataset and can't be generalized.

The study was conducted by Cynthia Dwork, distinguished scientist at Microsoft Research, Vitaly Feldman, research scientist at IBM's Almaden Research Center, Moritz Hardt, research scientist at Google, Toniann Pitassi, professor in the Department of Computer Science at the University of Toronto, Omer Reingold, principle researcher at Samsung Research America, and Aaron Roth, assistant professor in the Department of Computer and Information Science in the University of Pennsylvania's School of Engineering and Applied Science.

Adaptive analysis, where multiple tests on a dataset are combined to increase their predictive power, is an increasingly common technique. It also has the ability to deceive.

A video overview of the results discussed in the Dwork et al. paper. Credit: Aaron Roth

Imagine receiving an anonymous tip via email one morning saying the price of a certain stock will rise by the end of the day. At the closing bell, the tipster's prediction is borne out, and another prediction is made. After a week of unbroken success, the tipster begins charging for his proven prognostication skills.

Many would be inclined to take up the tipster's offer and fall for this scam. Unbeknownst to his victims, the tipster started by sending random predictions to thousands of people, and only repeated the process with the ones that ended up being correct by chance. While only a handful of people might be left by the end of the week, each sees what appears to be a powerfully predictive correlation that is actually nothing more than a series of lucky coin-flips.

In the same way, "adaptively" testing many hypotheses on the same data, each new one influenced by the last, can make random noise seem like a signal: what is known as a false discovery. Because the correlations of these false discoveries are idiosyncratic to the dataset in which they were generated, they can't be reproduced when other researchers try to replicate them with new data.

The traditional way to check that a purported signal is not just coincidental noise is to use a "holdout." This is a data set that is kept separate while the bulk of the data is analyzed. Hypotheses generated about correlations between items in the bulk data can be tested on the holdout; real relationships would exist in both sets, while false ones would fail to be replicated.

The problem with using holdouts in that way is that, by nature, they can only be reused if each hypothesis is independent of each other. Even a few additional hypotheses chained off one another could quickly lead to false discovery.

To this end, the researchers developed a tool known as a "reusable holdout." Instead of testing hypothesis on the holdout set directly, scientists would query it through a "differentially private" algorithm.

The "different" in its name is a reference to the guarantee that a differentially private algorithm makes. Its analyses should remain functionally identical when applied to two different datasets: one with and one without the data from any single individual. This means that any findings that would rely on idiosyncratic outliers of a given set would disappear when looking at data through a differentially private lens.

To test their algorithm, the researchers performed adaptive data analysis on a set rigged so that it contained nothing but random noise. The set was abstract, but could be thought of as one that tested 20,000 patients on 10,000 variables, such as variants in their genomes, for ones that were predictive of lung cancer.

Though, by design, none of the variables in the set were predictive of cancer, reuse of a holdout set in the standard way showed that 500 of them had significant predictive power. Performing the same analysis with the researchers' reusable holdout tool, however, correctly showed the lack of meaningful correlations.

An experiment with a second rigged dataset depicted a more realistic scenario. There, some of the variables did have predictive power, but traditional holdout use created a combination of variables with wildly overestimated this power. The reusable holdout tool correctly identified the 20 that had true statistical significance.

Beyond pointing out the dangers of accidental overfitting, the reusable holdout algorithm could warn users when they were exhausting the validity of a dataset. This is a red flag for what is known as "p-hacking," or intentionally gaming the data to get a publishable level of significance.

Implementing the reusable holdout algorithm will allow scientists to generate stronger, more generalizable findings from smaller amounts of data.

More information: "The reusable holdout: Preserving validity in adaptive data analysis," by C. Dwork et al. Science, www.sciencemag.org/lookup/doi/ … 1126/science.aaa9375

Journal information: Science

Provided by University of Pennsylvania

Citation: New algorithm aimed at combating science's reproducibility problem (2015, August 6) retrieved 17 July 2024 from https://phys.org/news/2015-08-algorithm-aimed-combating-science-problem.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Young and Karr propose ways to improve how observational studies are conducted

645 shares

Feedback to editors

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

10 hours ago

Intensive farming could raise risk of new pandemics, researchers warn

11 hours ago

Scientists develop new AI method to create material 'fingerprints'

14 hours ago

Study shows frogs can quickly increase their tolerance to pesticides

14 hours ago

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

14 hours ago

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

15 hours ago

Scientists use machine learning to predict diversity of tree species in forests

16 hours ago

Physicists pool skills to better describe the unstable sigma meson particle

17 hours ago

Telescope tag-team discovers 10 strange and exotic pulsars

17 hours ago

NASA transmits hip-hop song to deep space for first time

17 hours ago

Load comments (0)

New algorithm aimed at combating science's reproducibility problem

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Particle.js: Exploring Particle Physics with Web Technologies

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

Young and Karr propose ways to improve how observational studies are conducted

Risk patterns identified that make people more vulnerable to PTSD

Researcher discusses a new study on correlations among medical problems

Researchers develop 'personalized advantage index,' a new decision-making tool

Planarian regeneration model discovered by artificial intelligence

Novel algorithm identifies DNA copy-number landscapes in African American colon cancers

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

New algorithm aimed at combating science's reproducibility problem

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

Young and Karr propose ways to improve how observational studies are conducted

Risk patterns identified that make people more vulnerable to PTSD

Researcher discusses a new study on correlations among medical problems

Researchers develop 'personalized advantage index,' a new decision-making tool

Planarian regeneration model discovered by artificial intelligence

Novel algorithm identifies DNA copy-number landscapes in African American colon cancers

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience