November 9, 2012

Researcher develops computational text analysis method made possible regardless of language or domain

Computational text analysis made possible regardless of language or domain

The Internet is awash with text. Databases swell larger and larger by the minute. How can the vast amount of textual data be systematically analysed and managed, as the number of languages, domains, styles and dialects is getting countless? The task is too much for the human brain. Traditional methods for textual analysis run short. What we need are statistical methods, data mining and machine learning.

Mari-Sanna Paukkeri has studied how textual data can be processed and analysed automatically with machine learning methods. In her doctoral dissertation for the Aalto University Department of Information and Computer Science, Paukkeri has developed computational methods for text processing independent of language or domain.

With these methods, textual data sets are mined with algorithms for statistical dependencies and structures, from which specific properties of texts can then be extracted.

"Languages appear to be alike: sequential symbols form words, which build up to sentences. Large masses of text are examined for co-occurrences and structures in the use of language in order to make sense of individual sentences and words," sums up Paukkeri the principle of computational linguistics.

"Precisely these co-occurrences in language enable the computational study of texts regardless of the language or domain."

Unsupervised machine learning extracts relevant information from massive textual data sets

Paukkeri has especially studied the applicability of unsupervised machine learning to natural language processing. The field has traditionally made use of rule-based methods, in which the words and structures to be sought for are manually pre-defined for the data processing models.

"In unsupervised machine learning methods, the data set is not manually pre-processed in any way: the algorithms are left to their own devices to find out what the data is like and what kind of statistical dependencies and structures it holds. The methods are not told whether they are performing correctly or not; they work independently, without manual labour," explains Paukkeri.

Paukkeri finds an analogy for unsupervised machine learning in the way a child learns to use language.

"A child does not dabble in language grammar first, but imitates, experiments and combines fragments."

In Paukkeri's dissertation a method called Likey, co-developed with her colleagues and her supervisor Docent Timo Honkela from Aalto University, is applied to keyphrase and keyword extraction from text documents of 11 different languages.

"Likey finds out how common certain words and pairs, threes and fours of words are in a data set. This way it defines the keywords and phrases for a specific document – solely on the basis of their frequency and context in the text."

An everyday example of very refined computational unsupervised text processing would be Google's translation application. The translations are based on the automatically analysed, enormous amount of text the search engine has in its use.

"Companies also have an awful lot of text tucked away in their databases, usually with only a simple search functionality to utilise them. These databases exceed human management abilities, but with my methods they could be categorised and analysed."

Global companies in particular could benefit from methods with which to process their textual data in all of their working languages around the world.

Also the subjective variation of language use is within the grasp of computational methods. Paukkeri has studied the automatic assessment of difficulty and comprehension of texts aimed both at experts and lay people. Paukkeri's research experiments on medical texts, but the method is, again, independent of the domain.

"A search engine could predict the knowledge level of each user and customise the difficulty of the search results accordingly."

Language-independent text mining could also, according to Paukkeri, contribute to the discovery of language universals, features that are common to all languages. They could be mined from data sets consisting of hundreds or even thousands of languages.

"Who says we cannot apply our knowledge of structural and lexical similarities between languages for machine learning systems? That is what people do as well when learning new languages," ponders Paukkeri.

Provided by Aalto University

Citation: Researcher develops computational text analysis method made possible regardless of language or domain (2012, November 9) retrieved 16 July 2024 from https://phys.org/news/2012-11-text-analysis-method-language-domain.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

In search of the key word: Bursts of certain words within a text are what make them keywords

0 shares

Feedback to editors

Silicon photonics light the way toward large-scale applications in quantum information

9 hours ago

Earth system scientists discover missing piece in climate models

9 hours ago

Research team uses satellite data and machine learning to predict typhoon intensity

10 hours ago

Researchers directly simulate the fusion of oxygen and carbon nuclei

10 hours ago

New tool can predict bitterness in foods without prior knowledge of their chemical structures

11 hours ago

Nano-confinement may be key to improving hydrogen production

11 hours ago

Superlubricity study shows a frictionless state can be achieved at macroscale

11 hours ago

How climate change is altering the Earth's rotation

11 hours ago

Surprising ring sheds light on galaxy formation

11 hours ago

New concept explains how tiny particles navigate water layers, with implications for marine conservation

11 hours ago

Load comments (1)

Researcher develops computational text analysis method made possible regardless of language or domain

Unsupervised machine learning extracts relevant information from massive textual data sets

Silicon photonics light the way toward large-scale applications in quantum information

Earth system scientists discover missing piece in climate models

Research team uses satellite data and machine learning to predict typhoon intensity

Researchers directly simulate the fusion of oxygen and carbon nuclei

New tool can predict bitterness in foods without prior knowledge of their chemical structures

Nano-confinement may be key to improving hydrogen production

Superlubricity study shows a frictionless state can be achieved at macroscale

How climate change is altering the Earth's rotation

Surprising ring sheds light on galaxy formation

New concept explains how tiny particles navigate water layers, with implications for marine conservation

Relevant PhysicsForums posts

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

Number of Multiplications in the FFT Algorithm

In search of the key word: Bursts of certain words within a text are what make them keywords

New study may revolutionize language learning

New site to use crowd-sourcing as means to translate the internet

Most European languages in danger of digital extinction

Research team develops systems that process and understand spoken language, especially Basque

Researchers design machine learning technique to improve consumer medical searches

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Researcher develops computational text analysis method made possible regardless of language or domain

Unsupervised machine learning extracts relevant information from massive textual data sets

Silicon photonics light the way toward large-scale applications in quantum information

Earth system scientists discover missing piece in climate models

Research team uses satellite data and machine learning to predict typhoon intensity

Researchers directly simulate the fusion of oxygen and carbon nuclei

New tool can predict bitterness in foods without prior knowledge of their chemical structures

Nano-confinement may be key to improving hydrogen production

Superlubricity study shows a frictionless state can be achieved at macroscale

How climate change is altering the Earth's rotation

Surprising ring sheds light on galaxy formation

New concept explains how tiny particles navigate water layers, with implications for marine conservation

Relevant PhysicsForums posts

Related Stories

In search of the key word: Bursts of certain words within a text are what make them keywords

New study may revolutionize language learning

New site to use crowd-sourcing as means to translate the internet

Most European languages in danger of digital extinction

Research team develops systems that process and understand spoken language, especially Basque

Researchers design machine learning technique to improve consumer medical searches

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience