Digging for data with Chemlist and ChemSpider

Mar 22, 2010

Just like the rest of us, scientists today are swamped with information. As more chemical resources become freely available, text mining applications - previously focused on correctly identifying gene and protein names - are now shifting towards also correctly identifying chemical names. Now database experts have compared two chemical name dictionaries head to head, and report on the payoffs of manual versus automatic data curation in the open access publication, Journal of Cheminformatics.

Chemlist's creators wanted to investigate the effect extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. Kristina Hettne and her team based in the Netherlands, together with US-based colleagues, compared Chemlist, a dictionary for identifying small molecules and drugs in text automatically generated from a number of publicly available databases, with a second dictionary extracted from the ChemSpider database which has been curated manually to establish valid chemical name to structure relationships. To compare automatic curation with manual curation, the authors used only the ChemSpider component containing manually curated names and synonyms in their research.

The researchers tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of some 80,000 names was less than a third of the size of Chemlist at around 300,000. The ChemSpider dictionary had a precision of 0.43 and recall of 0.19 before filtering and disambiguation, with results of 0.87 and 0.19 after filtering and disambiguation. Meanwhile the Chemlist dictionary scored 0.20 for precision and 0.47 for recall before filtering and disambiguation, and 0.67 and 0.40 for these two measures afterwards.

This means that although ChemSpider achieved the best precision, the Chemlist had a higher recall and the best F-score, a function of a test's accuracy incorporating both precision and recall. "Rule-based filtering and disambiguation is necessary to achieve high precision for both automatically generated and the manually curated dictionaries," Hettne concludes. Antony Williams, project lead for ChemSpider comments "Such validated name-structure dictionaries studied in this work provide a strong foundation for semantic markup technologies, interlinking and various online resources." Both ChemSpider and the chemical databases included in Chemlist continue to grow at high speed, and further investigation is needed to see how this growth affects the performance of the dictionaries.

Explore further: Sweet-smelling breath to help diabetes diagnosis in children

More information:
ChemSpider is available at www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at www.biosemantics.org/chemlist

Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining, Kristina M Hettne, Antony J Williams, Erik M van Mulligen, Jos Kleinjans, Valery Tkachenko and Jan A Kors, Journal of Cheminformatics (in press), www.jcheminf.com/

add to favorites email to friend print save as pdf

Related Stories

Researchers create search engine to hunt molecules online

Jul 26, 2007

ChemxSeer, the first publicly available search engine designed specifically for chemical formulae, can sort out when "He" refers to helium and not a person more than nine times out of 10, according to the Penn State College ...

Future of biology rests in harnessing data avalanche

Sep 04, 2008

(PhysOrg.com) -- Like most sciences, biology is inundated with data. However, a group of researchers warns in a Nature feature that the avalanche of biological information is at the point where the discip ...

Patient privacy assured by electronic censor

Jul 24, 2008

Newly developed software will help to allay patients' fears about who has access to their confidential data. Research published today in the open access journal BMC Medical Informatics and Decision Making describes a comp ...

Recommended for you

Heat-conducting plastic developed

11 hours ago

The spaghetti-like internal structure of most plastics makes it hard for them to cast away heat, but a University of Michigan research team has made a plastic blend that does so 10 times better than its conventional ...

Electronic switches on the molecular scale

17 hours ago

A molecular electronic switch is a junction created from individual molecules that can alternate between two or more stable states, making the switch act as a conductor or an insulator. These switches show ...

Mimicking photosynthesis with man-made leaves

17 hours ago

Scientists have long been trying to emulate the way in which plants harvest energy from the sun through photosynthesis. Plants are able to absorb photons from even weak sunlight using light antennae made ...

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.