Researcher develops computational text analysis method made possible regardless of language or domain

Nov 09, 2012

The Internet is awash with text. Databases swell larger and larger by the minute. How can the vast amount of textual data be systematically analysed and managed, as the number of languages, domains, styles and dialects is getting countless? The task is too much for the human brain. Traditional methods for textual analysis run short. What we need are statistical methods, data mining and machine learning.

Mari-Sanna Paukkeri has studied how textual data can be processed and analysed automatically with machine learning methods. In her doctoral dissertation for the Aalto University Department of Information and , Paukkeri has developed computational methods for text processing independent of language or domain.

With these methods, textual data sets are mined with algorithms for statistical dependencies and structures, from which specific properties of texts can then be extracted.

"Languages appear to be alike: sequential symbols form words, which build up to sentences. Large masses of text are examined for co-occurrences and structures in the use of language in order to make sense of individual and words," sums up Paukkeri the principle of .

"Precisely these co-occurrences in language enable the computational study of texts regardless of the language or domain."

Unsupervised machine learning extracts relevant information from massive textual data sets

Paukkeri has especially studied the applicability of unsupervised machine learning to natural language processing. The field has traditionally made use of rule-based methods, in which the words and structures to be sought for are manually pre-defined for the data processing models.

"In unsupervised machine learning methods, the data set is not manually pre-processed in any way: the algorithms are left to their own devices to find out what the data is like and what kind of statistical dependencies and structures it holds. The methods are not told whether they are performing correctly or not; they work independently, without manual labour," explains Paukkeri.

Paukkeri finds an analogy for unsupervised machine learning in the way a child learns to use language.

"A child does not dabble in language grammar first, but imitates, experiments and combines fragments."

In Paukkeri's dissertation a method called Likey, co-developed with her colleagues and her supervisor Docent Timo Honkela from Aalto University, is applied to keyphrase and keyword extraction from text documents of 11 different languages.

"Likey finds out how common certain words and pairs, threes and fours of words are in a data set. This way it defines the keywords and phrases for a specific document – solely on the basis of their frequency and context in the text."

An everyday example of very refined computational unsupervised text processing would be Google's translation application. The translations are based on the automatically analysed, enormous amount of text the search engine has in its use.

"Companies also have an awful lot of text tucked away in their databases, usually with only a simple search functionality to utilise them. These databases exceed human management abilities, but with my methods they could be categorised and analysed."

Global companies in particular could benefit from methods with which to process their textual data in all of their working languages around the world.

Also the subjective variation of language use is within the grasp of . Paukkeri has studied the automatic assessment of difficulty and comprehension of texts aimed both at experts and lay people. Paukkeri's research experiments on medical texts, but the method is, again, independent of the domain.

"A search engine could predict the knowledge level of each user and customise the difficulty of the search results accordingly."

Language-independent text mining could also, according to Paukkeri, contribute to the discovery of universals, features that are common to all languages. They could be mined from data sets consisting of hundreds or even thousands of languages.

"Who says we cannot apply our knowledge of structural and lexical similarities between languages for systems? That is what people do as well when learning new languages," ponders Paukkeri.

Explore further: EU open source software project receives green light

Related Stories

New study may revolutionize language learning

Jan 27, 2009

( -- The teaching of languages could be revolutionised following ground-breaking research by Victoria University, New Zealand, PhD graduate Paul Sulzberger. Dr Sulzberger has found that the best way to learn a ...

Recommended for you

EU open source software project receives green light

Jul 01, 2015

An open source software project involving the University of Southampton to extend the capacity of computational mathematics and interactive computing environments has received over seven million euros in EU funding.

Can computers be creative?

Jul 01, 2015

The EU-funded 'What-if Machine' (WHIM) project not only generates fictional storylines but also judges their potential usefulness and appeal. It represents a major advance in the field of computational creativity.

Algorithm detects nudity in images, offers demo page

Jul 01, 2015

An algorithm has been designed to tell if somebody in a color photo is naked. launched earlier this month; its demo page invites you to try it out to test its power in nudity detection. You ...

User comments : 1

Adjust slider to filter visible comments by rank

Display comments: newest first

1 / 5 (1) Nov 09, 2012
babylon without morals is sand in waiting
however applied to genome research
and cymatics a holographic symbol
map could be used for AI interfacing
or just stick to ordering by numbers in a chinese take-away! :)

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.