Mining the language of science

November 18, 2011

Mining the language of science

Enlarge

Categorising textual information. Credit: iStockphoto/Enot Poluskun

(PhysOrg.com) -- Scientists are developing a computer that can read vast amounts of scientific literature, make connections between facts and develop hypotheses.

Ask any biomedical scientist whether they manage to keep on top of reading all of the publications in their field, let alone an adjacent field, and few will say yes. New publications are appearing at a double-exponential rate, as measured by MEDLINE – the US National Library of Medicine’s biomedical bibliographic database – which now lists over 19 million records and adds up to 4,000 new records daily.

For a prolific field such as cancer research, the number of publications could quickly become unmanageable and important hypothesis-generating evidence may be missed. But what if could instruct a computer to help them?

To be useful, a computer would need to trawl through the in the same way that a scientist would: reading the literature to uncover new knowledge, evaluating the quality of the information, looking for patterns and connections between facts, and then generating to test. Not only might such a program speed up the progress of scientific discovery but, with the capacity to consider vast numbers of factors, it might even discover information that could be missed by the human brain.

The aim of Dr. Anna Korhonen and researchers in the Natural Language and Information Processing Group in the University of Cambridge’s Computer Laboratory is to develop computers that can understand written language in the same way that humans do. One of the projects she is involved in has recently developed a method of ‘text mining’ one of the most literature-dependent areas of biomedicine: cancer risk assessment of chemicals.

Every year, thousands of new chemicals are developed, any one of which might pose a potential risk to human health. Complex risk assessment procedures are in place to determine the relationship between exposure and the likelihood of developing cancer, but it’s a lengthy process, as Royal Society University Research Fellow Dr Korhonen explained: “The first stage of any risk assessment is a literature review. It’s a major bottleneck. There could be tens of thousands of articles for a single chemical. Performed manually, it’s expensive and, because of the rising number of publications, it’s becoming too challenging to manage.”

CRAB, the tool her team has developed in collaboration with Professor Ulla Stenius’ group at the Institute of Environmental Medicine at Sweden’s Karolinska Institutet, is a novel approach to cancer risk assessment that could help risk assessors move beyond manual literature review.

The approach is based on text-mining technology, which has been pioneered by computer scientists, and involves developing programs that can analyse natural language texts, despite their complexity, inconsistency and ambiguity. The tool Dr. Korhonen has developed with her colleagues is the first text-mining tool aimed at aiding literature review in chemical risk assessment.

At the heart of CRAB, the development of which was funded by the Medical Research Council and the Swedish Research Council among others, is a taxonomy that specifies scientific evidence used in cancer risk assessment, including key events that may result in cancer formation. The system takes the textual content of each relevant MEDLINE abstract and classifies it according to the taxonomy. At the press of a button, a profile is rapidly built for any particular chemical using all of the available literature, describing highly specific patterns of connections between chemicals and toxicity.

“Although still under development, the system can be used to make connections that would be difficult to find, even if it had been possible to read all the documents,” added Dr. Korhonen. “In a recent experiment, we studied a group of chemicals with unknown mode of action and used the CRAB tool to suggest a new hypothesis that might explain their male-specific carcinogenicity in the pancreas.”

The tool will be available for end-users via an online web interface. However, research into improving text mining will continue. One of the biggest current challenges is to develop adaptive technology that can be ported easily between different text types, tasks and scientific fields.

One day, rather than being at the mercy of the flourishing rate of publication, scientists will have at their fingertips a system to work alongside them that will not only point them towards those references that are relevant to their search, but will also tell them why.

Provided by University of Cambridge search and more info website

Filter


Move the slider to adjust rank threshold, so that you can hide some of the comments.


Display comments: newest first

Nerdyguy
Nov 18, 2011

Rank: 1.3 / 5 (3)
"Scientists are developing a computer that can read vast amounts of scientific literature, make connections between facts and develop hypotheses."

Wait. Stop there. Who gets to decide which are the "facts" upon which this system will build its ultimate conclusions? Setting aside for a moment such emotionally debated topics as climate change, what about the volume of just poorly-designed studies which are later shown to be invalid? Presumably, this would just be a faster method of dispensing bad science to the world. On the other hand, that would make it no different from the current system, except for the speed.
Jeddy_Mctedder
Nov 18, 2011

Rank: not rated yet
combine this witg a generalist creative thinker like ken jennings---- let him learn to use this as a tool. the combo could be wildly powerful
Jotaf
Nov 18, 2011

Rank: not rated yet
Nerdyguy: In natural language processing, a crucial issue is dealing with inconsistencies/noise (as pointed out in the article). You simply can't do any processing without it, because you can't model all the subtleties of language and human thinking.

The inconsistencies in the data are presumably treated in the same way as in the natural language (they're bundled together in the articles). So any outlier which doesn't agree with most other studies won't be taken into account, same way as a portion of writing that doesn't make sense to the system.

In the worst-case scenario that the majority of studies are wrong but in exactly the same way, the system can't do better than human scientists would. The general assumption is that repeatable experiments are correct, and you need a very, very good reason to dispute that.

I certainly wouldn't mind an automated companion to give me a general overview of a lot of papers at once!
Nerdyguy
Nov 18, 2011

Rank: 5 / 5 (1)
The inconsistencies in the data are presumably treated in the same way as in the natural language (they're bundled together in the articles). So any outlier which doesn't agree with most other studies won't be taken into account, same way as a portion of writing that doesn't make sense to the system.

In the worst-case scenario that the majority of studies are wrong but in exactly the same way, the system can't do better than human scientists would. The general assumption is that repeatable experiments are correct, and you need a very, very good reason to dispute that.

I certainly wouldn't mind an automated companion to give me a general overview of a lot of papers at once!


Makes sense to me. May still have the same biases as humans, but will be faster.
Sean_W
Nov 18, 2011

Rank: 1 / 5 (2)
What if it comes up with the hypothesis that scientists are bad at statistics or that meta-analysis is a crock or that the peer review process is in terrible need of reform? Will we be able to fire it?
rwinners
Nov 19, 2011

Rank: 1 / 5 (1)
Hey, a thinking computer. What a concept!
hush1
Nov 19, 2011

Rank: not rated yet
...develop computers that can understand written language in the same way that humans do


An ambitious goal. Let me understand how humans understand written language first. Then use this knowledge to help humans understand written language. Then develop computers that can understand written language.

Of course text mining is orders of magnitude below this ambitious goal.
Seeker2
Dec 08, 2011

Rank: not rated yet
Presumably, this would just be a faster method of dispensing bad science to the world.

Maybe we could dispense of bad science period.
Seeker2
Dec 08, 2011

Rank: not rated yet
What if it comes up with the hypothesis that scientists are bad at statistics or that meta-analysis is a crock or that the peer review process is in terrible need of reform? Will we be able to fire it?

I think I can draw my own conclusions, if appropriate. Just give me the relevant facts. As for making hypotheses man that would be scary. Should be good for laughs though.
Seeker2
Dec 08, 2011

Rank: not rated yet
Who gets to decide which are the "facts" upon which this system will build its ultimate conclusions?
A system should be able to gather some facts. Watch out for those ultimate conclusions.
Setting aside for a moment such emotionally debated topics as climate change, what about the volume of just poorly-designed studies which are later shown to be invalid?
Poorly designed studies should be identified before they can spread. Especially controversial issues like climate change.
Rank 5 /5 (9 votes)
Relevant PhysicsForums posts
  • Ideas to mitigate risk of 911 calls being misdirected
    createdMay 24, 2012
  • Live scribe pen?
    createdMay 10, 2012
  • Shallow water flow simulation
    createdMay 07, 2012
  • Tablet for taking notes?
    createdMay 05, 2012
  • Best fit tablet for me?
    createdMay 05, 2012
  • Measure of Informaton
    createdMay 04, 2012
  • More from Physics Forums - Computing & Technology

More news stories

Browser wars flare in mobile space

The browser wars are heating up again, but this time the fight is for dominance of the mobile Internet.

Technology / Software

created 14 hours ago | popularity 5 / 5 (2) | comments 3

Probability of contamination from severe nuclear reactor accidents is higher than expected: study

Catastrophic nuclear accidents such as the core meltdowns in Chernobyl and Fukushima are more likely to happen than previously assumed. Based on the operating hours of all civil nuclear reactors and the number ...

Technology / Energy & Green Tech

created May 22, 2012 | popularity 3.6 / 5 (25) | comments 56 | with audio podcast

HyperSolar shows dirty water no barrier to power world

(Phys.org) -- The Santa Barbara, California, company, HyperSolar, is set to transparently share the ups and downs of its research experiences toward the company’s ultimate vision, successfully producing ...

Technology / Energy & Green Tech

created May 24, 2012 | popularity 4.8 / 5 (16) | comments 17 | with audio podcast report

SpotterRF debuts Radar Backpack Kit (w/ Video)

(Phys.org) -- SpotterRF has announced a special radar backpack kit designed to enhance situational awareness for soldiers on the ground. The company says its special radar is designed for warfighters as part ...

Technology / Hi Tech & Innovation

created May 26, 2012 | popularity 5 / 5 (5) | comments 13 | with audio podcast report

Tesla to launch electric sedan in US on June 22

Tesla Motors said Tuesday it would begin deliveries of "the world's first premium electric sedan" on June 22, slightly ahead of schedule.

Technology / Energy & Green Tech

created May 22, 2012 | popularity 4.5 / 5 (12) | comments 18


Stunning image of smallest possible five-ringed structure

Scientists have created and imaged the smallest possible five-ringed structure – about 100,000 times thinner than a human hair – and you'll probably recognise its shape.

'Unzipped' carbon nanotubes could help energize fuel cells, batteries

Multi-walled carbon nanotubes riddled with defects and impurities on the outside could replace some of the expensive platinum catalysts used in fuel cells and metal-air batteries, according to scientists at ...

Change in developmental timing was crucial in the evolutionary shift from dinosaurs to birds: study

At first glance, it's hard to see how a common house sparrow and a Tyrannosaurus Rex might have anything in common. After all, one is a bird that weighs less than an ounce, and the other is a dinosaur that ...

Computer model used to pinpoint prime materials for efficient carbon capture

When power plants begin capturing their carbon emissions to reduce greenhouse gases – and to most in the electric power industry, it's a question of when, not if – it will be an expensive undertaking.

T cells 'hunt' parasites like animal predators seek prey, study shows

By pairing an intimate knowledge of immune-system function with a deep understanding of statistical physics, a cross-disciplinary team at the University of Pennsylvania has arrived at a surprising finding: T cells use a movement ...

Land and sea species differ in climate change response: study

(Phys.org) -- Marine and terrestrial species will likely differ in their responses to climate warming, new research by Simon Fraser University and Australia’s University of Tasmania has found.