Wide-Open accelerates release of scientific data by identifying overdue datasets

June 8, 2017, Public Library of Science
The use of Wide-Open on the Gene Expression Omnibus (GEO) led to the dramatic drop of overdue datasets, with 400 datasets released within the first week. Credit: Maxim Grechkin, Roli Roberts

Advances in genetic sequencing and other technologies have led to an explosion of biological data, and decades of openness (both spontaneous and enforced) mean that scientists routinely deposit data in online repositories. But researchers are only human and may forget to tell a repository to release the data when a paper is published.

A new tool, developed by University of Washington and Microsoft researchers Maxim Grechkin, Hoifung Poon and Bill Howe, and described in a Community Page article publishing June 8 in the open access journal PLOS Biology, hopes to get around this problem and help advance open science by automatically detecting datasets that are overdue for publication.

Open data is a vital pillar of open science, enabling other researchers to reproduce results and use the same datasets to produce novel discoveries. While many scientific journals now require published authors to make the data underlying their findings publicly available, these policies often go unenforced. The challenge is substantial - the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) alone contains 80,985 public datasets, spanning hundreds of tissue types in thousands of organisms - and the rapid growth in data makes it difficult for journals or data repositories to "police" whether datasets that should be made publicly available actually are.

The Wide-Open system is available under an on GitHub; it uses text mining to identify references in published scientific articles that should be publicly accessible, and then parses query results from repositories to determine if those datasets remain private.

Grechkin and his team tested their tool on two popular data repositories maintained by the NCBI - GEO and the Sequence Read Archive (SRA) . Wide-Open identified a large number of overdue datasets, which spurred repository administrators to respond by releasing 400 datasets in one week.

"We developed a simple yet effective system that has already helped make hundreds of datasets public," said lead author Maxim Grechkin. "Having an impartial and automated system enforce open data policies can help level the playing field among scientists and generate new opportunities for discovery."

Explore further: Scientists who share data publicly receive more citations

More information: Grechkin M, Poon H, Howe B (2017) Wide-Open: Accelerating public data release by automating detection of overdue datasets. PLoS Biol 15(6): e2002477. doi.org/10.1371/journal.pbio.2002477

Related Stories

Scientists who share data publicly receive more citations

October 1, 2013

A new study finds that papers with data shared in public gene expression archives received increased numbers of citations for at least five years. The large size of the study allowed the researchers to exclude confounding ...

Simple errors limit scientific scrutiny

November 11, 2015

Researchers have found more than half of the public datasets provided with scientific papers are incomplete, which prevents reproducibility tests and follow-up studies.

Realistic data needed to evolve the 21st century power grid

January 25, 2016

Say you have a great new theory or technology to improve the nation's energy backbone—the electric grid. Wouldn't it be great to test it against a model complete with details that would tell you how your ideas would work? ...

Manuscript at the click of a button

October 13, 2015

Data collection and analysis are at the core of modern research, and often take months or even years during which researchers remain uncredited for their contribution. A new plugin to a workflow previously developed by the ...

Recommended for you

A world of parasites

May 25, 2018

Alex Betts, Craig MacLean and Kayla King from the Department of Zoology, shed light on their recent research published in Science, which addressed the impact that parasite communities have on evolutionary change and diversity.

A better B1 building block

May 25, 2018

Humans aren't the only earth-bound organisms that need to take their vitamins. Thiamine – commonly known as vitamin B1 – is vital to the survival of most every living thing on earth. But the average bacterium or plant ...

Plant symbioses—fragile partnerships

May 25, 2018

All plants require an adequate supply of inorganic nutrients, such as fixed nitrogen (usually in the form of ammonia or nitrate), for growth. A special group of flowering plants thus depends on close symbiotic relationships ...

Bumblebees confused by iridescent colors

May 25, 2018

Iridescence is a form of structural colour which uses regular repeating nanostructures to reflect light at slightly different angles, causing a colour-change effect.

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.