share this!
2
9
Share
Email

October 29, 2018

Toward language inference in medicine

by Chaitanya Shivade, IBM

Recent times have witnessed significant progress in natural language understanding by AI, such as machine translation and question answering. A vital reason behind these developments is the creation of datasets, which use machine learning models to learn and perform a specific task. Construction of such datasets in the open domain often consists of text originating from news articles. This is typically followed by collection of human annotations from crowd-sourcing platforms such as Crowdflower, or Amazon Mechanical Turk.

However, language used in specialized domains such as medicine is entirely different. The vocabulary used by a physician while writing a clinical note is quite unlike the words in a news article. Thus, language tasks in these knowledge-intensive domains cannot be crowd-sourced since such annotations demand domain expertise. However, collecting annotations from domain experts is also very expensive. Moreover, clinical data is privacy-sensitive and hence cannot be shared easily. These hurdles have inhibited the contribution of language datasets in the medical domain. Owing to these challenges, validation of high-performing algorithms from the open domain on clinical data remains uninvestigated.

In order to address these gaps, we worked with the Massachusetts Institute of Technology to build MedNLI, a dataset annotated by doctors, performing a natural language inference (NLI) task and grounded in the medical history of patients. Most importantly, we make it publicly available for researchers to advance natural language processing in medicine.

We worked with the MIT Critical Data research labs to construct a dataset for natural language inference in medicine. We used clinical notes from their "Medical Information Mart for Intensive Care" (MIMIC) database, which is arguably the largest publicly available database of patient records. The clinicians in our team suggested that the past medical history of a patient contains vital information from which useful inferences can be drawn. Therefore, we extracted the past medical history from clinical notes in MIMIC and presented a sentence from this history as a premise to a clinician. They were then requested to use their medical expertise and generate three sentences: a sentence that was definitely true about the patient, given the premise; a sentence that was definitely false, and finally a sentence that could possibly be true.

Over a few months, we randomly sampled 4,683 such premises and worked with four clinicians to construct MedNLI, a dataset of 14,049 premise-hypothesis pairs. In the open domain, other examples of similarly built datasets include the Stanford Natural Language Inference dataset, which was curated with the help of 2,500 workers on Amazon Mechanical Turk and consists of 0.5M premise-hypothesis pairs where premise sentences were drawn from captions of Flickr photos. MultiNLI is another and consists of premise text from specific genres such as fiction, blogs, phone conversations, etc.

Dr. Leo Anthony Celi (Principal Scientist for MIMIC) and Dr. Alistair Johnson (Research Scientist) from MIT Critical Data worked with us for making MedNLI publicly available. They created the MIMIC Derived Data repository, to which MedNLI acted as the first natural language processing dataset contribution. Any researcher with access to MIMIC can also download MedNLI from this repository.

Although of a modest size compared with the open domain datasets, MedNLI is large enough to inform researchers as they develop new machine learning models for language inference in medicine. Most importantly, it presents interesting challenges that call for innovative ideas. Consider a few examples from MedNLI:

In order to conclude entailment in the first example, one should be able to expand the abbreviations ALT, AST, and LFTs; understand that they are related; and further conclude that an elevated measurement is abnormal. The second example depicts a subtle inference of concluding that emergence of an infant is a description of its birth. Finally, the last example shows how common world knowledge is used to derive inferences.

State-of-the-art deep learning algorithms can perform highly on language tasks because they have the potential to become very good at learning an accurate mapping from inputs to outputs. Thus, training on a large dataset annotated using crowd-sourced annotations is the often a recipe for success. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized and knowledge-intensive domains such as medicine, where training data is limited and language is much more nuanced.

Finally, although great strides have been made in learning a language task end-to-end, there is still a need for additional techniques that can incorporate expert curated knowledge bases into these models. For example, SNOMED-CT is an expert curated medical terminology with 300K+ concepts and relations between the terms in its dataset. Within MedNLI, we made simple modifications to existing deep neural network architectures to infuse knowledge from knowledge bases such as SNOMED-CT. However, a large amount of knowledge still remains untapped.

We hope MedNLI opens up new directions of research in the natural language processing community.

Provided by IBM

This story is republished courtesy of IBM Research. Read the original story here.

Citation: Toward language inference in medicine (2018, October 29) retrieved 19 April 2024 from https://phys.org/news/2018-10-language-inference-medicine.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

A new open source dataset links human motion and language

11 shares

Feedback to editors

Ghost particle on the scales: Research offers more precise determination of neutrino mass

38 minutes ago

Light show in living cells: New method allows simultaneous fluorescent labeling of many proteins

38 minutes ago

Warming of Antarctic deep-sea waters contribute to sea level rise in North Atlantic, study finds

38 minutes ago

Unraveling water mysteries beyond Earth: Ground-penetrating radar will seek bodies of water on Jupiter

48 minutes ago

Baby white sharks prefer being closer to shore, scientists find

5 hours ago

Key protein regulates immune response to viruses in mammal cells

9 hours ago

Unraveling the mysteries of consecutive atmospheric river events

12 hours ago

Research team resolves decades-long problem in microscopy

12 hours ago

RNA's hidden potential: New study unveils its role in early life and future bioengineering

13 hours ago

Smoother surfaces make for better accelerators

13 hours ago

Load comments (0)

Toward language inference in medicine

Ghost particle on the scales: Research offers more precise determination of neutrino mass

Light show in living cells: New method allows simultaneous fluorescent labeling of many proteins

Warming of Antarctic deep-sea waters contribute to sea level rise in North Atlantic, study finds

Unraveling water mysteries beyond Earth: Ground-penetrating radar will seek bodies of water on Jupiter

Baby white sharks prefer being closer to shore, scientists find

Key protein regulates immune response to viruses in mammal cells

Unraveling the mysteries of consecutive atmospheric river events

Research team resolves decades-long problem in microscopy

RNA's hidden potential: New study unveils its role in early life and future bioengineering

Smoother surfaces make for better accelerators

Relevant PhysicsForums posts

Error logging in: onLoginSuccess is not a function

My Website For Creating Interactive Visuals Linked To Equations

Latest Notable AI accomplishments

Building a homemade Long Short Term Memory with FSMs

Most efficient way to randomly choose a word from a file with a list of words

Git, staging and committing files

A new open source dataset links human motion and language

Using multi-task learning for low-latency speech translation

AI-assisted note-taking for electronic health records

Teaching AI to learn from non-experts

Using machine learning to detect software vulnerabilities

Machine learning techniques generate clinical labels of medical scans

Machine learning approach for low-dose CT imaging yields superior results

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Team breaks world record for fast, accurate AI training

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Medical Xpress

Tech Xplore

Science X

Toward language inference in medicine

Ghost particle on the scales: Research offers more precise determination of neutrino mass

Light show in living cells: New method allows simultaneous fluorescent labeling of many proteins

Warming of Antarctic deep-sea waters contribute to sea level rise in North Atlantic, study finds

Unraveling water mysteries beyond Earth: Ground-penetrating radar will seek bodies of water on Jupiter

Baby white sharks prefer being closer to shore, scientists find

Key protein regulates immune response to viruses in mammal cells

Unraveling the mysteries of consecutive atmospheric river events

Research team resolves decades-long problem in microscopy

RNA's hidden potential: New study unveils its role in early life and future bioengineering

Smoother surfaces make for better accelerators

Relevant PhysicsForums posts

Related Stories

A new open source dataset links human motion and language

Using multi-task learning for low-latency speech translation

AI-assisted note-taking for electronic health records

Teaching AI to learn from non-experts

Using machine learning to detect software vulnerabilities

Machine learning techniques generate clinical labels of medical scans

Recommended for you

Machine learning approach for low-dose CT imaging yields superior results

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Team breaks world record for fast, accurate AI training

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Newsletter sign up

Donate and enjoy an ad-free experience