January 29, 2015

New algorithm can separate unstructured text into topics with high accuracy and reproducibility

by Emily Ayshford, Northwestern University

Much of our reams of data sit in large databases of unstructured text. Finding insights among emails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way.

One of the leading big data algorithms for finding related topics within unstructured text (an area called topic modeling) is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modeling algorithm should be.

Using his network analysis background, Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modeling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published Jan. 29 in Physical Review X.

Topic modeling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modeling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.

When Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.

"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility," he said. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case," Amaral said.

To create a better algorithm, Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words (just as one could look for communities of people in Facebook). The words within a given community define a topic.

The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.

These results show the need for more testing of big data algorithms and more research into making them more accurate and reproducible, Amaral said.

"Companies that make products must show that their products work," he said. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."

Journal information: Physical Review X

Provided by Northwestern University

Citation: New algorithm can separate unstructured text into topics with high accuracy and reproducibility (2015, January 29) retrieved 1 May 2024 from https://phys.org/news/2015-01-algorithm-unstructured-text-topics-high.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Dieters making more connections in online weight-loss communities lose more weight, study finds

2210 shares

Feedback to editors

New observatory in Chile—the highest in the world—aims to reveal origins of planets, galaxies and more

5 hours ago

Study uncovers the secret of long-lived stem cells

9 hours ago

Scientists show that ancient village adapted to drought, rising seas

10 hours ago

How polyps of the moon jellyfish repel viral attacks on their microbiome

10 hours ago

Organic electrochemical transistors: Scientists solve chemical mystery at the interface of biology and technology

11 hours ago

Activity in a room stirs up nanoparticles left over from consumer sprays, study shows

11 hours ago

Study dispels myth that purebred dogs are more prone to health problems

11 hours ago

Study shows climate change and mercury pollution stressed plants for millions of years

12 hours ago

Exploiting disorder to harvest heat energy: The potentialities of 2D magnets for thermoelectric applications

13 hours ago

Citizen scientists help discover record-breaking exoplanet in binary star system

13 hours ago

Load comments (2)

New algorithm can separate unstructured text into topics with high accuracy and reproducibility

New observatory in Chile—the highest in the world—aims to reveal origins of planets, galaxies and more

Study uncovers the secret of long-lived stem cells

Scientists show that ancient village adapted to drought, rising seas

How polyps of the moon jellyfish repel viral attacks on their microbiome

Organic electrochemical transistors: Scientists solve chemical mystery at the interface of biology and technology

Activity in a room stirs up nanoparticles left over from consumer sprays, study shows

Study dispels myth that purebred dogs are more prone to health problems

Study shows climate change and mercury pollution stressed plants for millions of years

Exploiting disorder to harvest heat energy: The potentialities of 2D magnets for thermoelectric applications

Citizen scientists help discover record-breaking exoplanet in binary star system

Relevant PhysicsForums posts

User-Defined Functions in Sql Server SSMS

Classifiers, threshold, and ROC curve

Parallel processing for loops and pointer defined outside the loop

Passing variables in FORTRAN

My Website For Creating Interactive Visuals Linked To Equations

Number of Multiplications in the FFT Algorithm

Dieters making more connections in online weight-loss communities lose more weight, study finds

Technique enables pattern-recognition systems to convey what they learn to humans

Automated method beats critics in picking great movies

Researchers create first image-recognition software that greatly improves web searches

New search engine lets users look for relevant results faster

Fast algorithm extracts, compares document meaning

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

New algorithm can separate unstructured text into topics with high accuracy and reproducibility

New observatory in Chile—the highest in the world—aims to reveal origins of planets, galaxies and more

Study uncovers the secret of long-lived stem cells

Scientists show that ancient village adapted to drought, rising seas

How polyps of the moon jellyfish repel viral attacks on their microbiome

Organic electrochemical transistors: Scientists solve chemical mystery at the interface of biology and technology

Activity in a room stirs up nanoparticles left over from consumer sprays, study shows

Study dispels myth that purebred dogs are more prone to health problems

Study shows climate change and mercury pollution stressed plants for millions of years

Exploiting disorder to harvest heat energy: The potentialities of 2D magnets for thermoelectric applications

Citizen scientists help discover record-breaking exoplanet in binary star system

Relevant PhysicsForums posts

Related Stories

Dieters making more connections in online weight-loss communities lose more weight, study finds

Technique enables pattern-recognition systems to convey what they learn to humans

Automated method beats critics in picking great movies

Researchers create first image-recognition software that greatly improves web searches

New search engine lets users look for relevant results faster

Fast algorithm extracts, compares document meaning

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience