New text-mining algorithm to prioritize research on chemicals, disease for public database

Apr 17, 2013

Keeping up with current scientific literature is a daunting task, considering that hundreds to thousands of papers are published each day. Now researchers from North Carolina State University have developed a computer program to help them evaluate and rank scientific articles in their field.

The researchers use a text-mining algorithm to prioritize research papers to read and include in their Comparative Toxicogenomics Database (CTD), a public database that manually curates and codes data from the scientific literature describing how environmental chemicals interact with genes to affect human health.

"Over 33,000 scientific papers have been published on heavy metal toxicity alone, going as far back as 1926," explains Dr. Allan Peter Davis, a biocuration project manager for CTD at NC State who worked on the project and co-lead author of an article on the work. "We simply can't read and code them all. And, with the help of this new algorithm, we don't have to."

To help select the most relevant papers for inclusion in the CTD, Thomas Wiegers, a research bioinformatician at NC State and the other co-lead author of the report, developed a sophisticated algorithm as part of a text-mining process. The application evaluates the text from thousands of papers and assigns a relevancy score to each document. "The score ranks the set of articles to help separate the wheat from the chaff, so to speak," Wiegers says.

But how good is the algorithm at determining the best papers? To test that, the researchers text-mined 15,000 articles and sent a representative sample to their team of biocurators to manually read and evaluate on their own, blind to the computer's score. "The results were impressive," Davis says. The biocurators concurred with the algorithm 85 percent of the time with respect to the highest-scored papers.

Using the algorithm to rank papers allowed biocurators to focus on the most relevant papers, increasing productivity by 27 percent and novel data content by 100 percent. "It's a tremendous time-saving step," Davis explains. "With this we can allocate our resources much more effectively by having the team focus on the most informative papers."

There are always outliers in these types of experiments: occasions where the algorithm assigns a very high score to an article that a human biocurator quickly dismisses as irrelevant. The team that looked at those outliers was often able to see a pattern as to why the algorithm mistakenly identified a paper as important. "Now, we can go back and tweak the algorithm to account for this and fine-tune the system," Wiegers says.

"We're not at the point yet where a computer can read and extract all the relevant data on its own," Davis concludes, "but having this text-mining process to direct us toward the most informative articles is a huge first step."

Explore further: Researchers developing algorithms to detect fake reviews

More information: Davis AP, Wiegers TC, Johnson RJ, Lay JM, Lennon-Hopkins K, et al. (2013) Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database. PLOS ONE 8(4): e58201. doi:10.1371/journal.pone.0058201

Related Stories

Free articles get read but don't generate more citations

Jul 31, 2008

When academic articles are "open access" or free online, they get read more often, but they don't -- going against conventional wisdom -- get cited more often in academic literature, finds a new Cornell study.

Mining the language of science

Nov 18, 2011

(PhysOrg.com) -- Scientists are developing a computer that can read vast amounts of scientific literature, make connections between facts and develop hypotheses.

Longitudinal algorithm may detect ovarian cancer earlier

Dec 21, 2012

(HealthDay)—Compared with a single-threshold (ST) rule, a parametric empirical Bayes (PEB) longitudinal screening algorithm can identify ovarian cancer earlier and at a lower concentration of CA125, according ...

Plagiarism sleuths tackle full-text biomedical articles

Oct 25, 2010

In scientific publishing, how much reuse of text is too much? Researchers at the Virginia Bioinformatics Institute at Virginia Tech and collaborators have shown that a computer-based text-searching tool is capable of unearthing ...

Recommended for you

Tablets, cars drive AT&T wireless gains—not phones

1 hour ago

AT&T says it gained 2 million wireless subscribers in the latest quarter, but most were from non-phone services such as tablets and Internet-connected cars. The company is facing pricing pressure from smaller rivals T-Mobile ...

Twitter looks to weave into more mobile apps

2 hours ago

Twitter on Wednesday set out to weave itself into mobile applications with a free "Fabric" platform to help developers build better programs and make more money.

Blink, point, solve an equation: Introducing PhotoMath

3 hours ago

"Ma, can I go now? My phone did my homework." PhotoMath, from the software development company MicroBlink, will make the student's phone do math homework. Just point the camera towards the mathematical expression, ...

Google unveils app for managing Gmail inboxes

3 hours ago

Google is introducing an application designed to make it easier for its Gmail users to find and manage important information that can often become buried in their inboxes.

User comments : 2

Adjust slider to filter visible comments by rank

Display comments: newest first

praos
1.8 / 5 (4) Apr 21, 2013
If all researchers use the same algorithm, they will read essentially the same papers, virtually erasing all the rest. That would make a chance discovery of some already published but neglected discovery a mission impossible; therefore some random factor should be incorporated into this algorithm.
manifespo
1 / 5 (2) Apr 23, 2013
Wish the owners of information would invest in freeing all scientific studies for any one to read. $30 to buy a digital ten page study, which costs 0.000001$ to copy as a marginal cost. Aaron r.I.p.