# In search of the key word: Bursts of certain words within a text are what make them keywords

##### Jul 17, 2012

Human beings have the ability to convert complex phenomena into a one-dimensional sequence of letters and put it down in writing. In this process, keywords serve to convey the content of the text. How letters and words correlate with the subject of a text is something Eduardo Altmann and his colleagues from the Max Planck Institute for the Physics of Complex Systems have studied with the help of statistical methods. They discovered that what denotes keywords is not the fact that they appear very frequently in a given text. It is that they are found in greater numbers only at certain points in the text. They also discovered that relationships exist between sections of text which are distant from each other, in the sense that they preferentially use the same words and letters.

The Dresden-based scientists mathematically studied the semantic properties of texts by translating ten different English texts into various codes. One of the chosen texts was the English edition of Leo Tolstoy's "War and Peace".

One example of what the scientists did was translate letters in a text into a binary sequence. They replaced all with 1 and all consonants with 0. By employing additional , the scientists examined different levels of the text – both individual vowels and letters, as well as whole – which had been translated into various codes. In so doing, it was possible to identify repeating patterns within the text as a whole. Such correlation within a text is referred to as long-range correlation. This indicates whether two letters located at arbitrarily distant points in the text are connected with each other. For example, when we find a letter "W" at a certain point, there is a measurably higher probability that we will find the letter "W" again a few pages later.

"Understandably enough, if a certain point in the book talks about war, there is a high probability that the word war will also appear a few pages later. What is surprising is that we also find this higher probability at the level of individual letters," says Altmann.

Keywords are more frequent in certain passages of text

The scientists found this long-range correlation not only between letters, but also within higher linguistic levels, such as words. Within individual levels, the correlation remains when looking at different texts. "What we find much more interesting is to examine how the correlation changes between the levels," says Altmann. Long-range correlation enables the scientists to draw conclusions about the extent to which certain words are connected to a topic. "Even the connection between a word and the letters it is composed of can be analysed in this way," explains Altmann.

Furthermore, the scientists also studied what is known as "burstiness", which describes whether increased occurrence of a pattern of characters is present in a passage of text. It shows, for instance, whether a word comes up at increased frequency in a certain text section. The more frequently a certain word is used in a passage, the more likely it is that that word is representative of a certain subject.

The scientists demonstrated that certain words come up repeatedly throughout a text, are however not present in bursts in a given text passage. Although these words do exhibit long-range correlation, they are not closely related to the topic at hand. "Articles are the best examples of these. They come up very frequently in every text, but they are not crucial in conveying a given topic," says Altmann.

Statistical text analysis works irrespective of language

Whereas both letters and words exhibit long-range , it is rare for letters to appear in bursts at certain points in a text. "It is, in fact, very rare for a letter to be as closely connected with a topic as the word it forms a part of. In a manner of speaking, letters can be used more flexibly," explains Altmann. An "a", for example, can be a part of a great many words that have no connection with one and the same topic.

The scientists employed statistical text analysis as an easy way of identifying the defining words of a given text. "By so doing, it is absolutely irrelevant which language the is written in. The only thing that matters is the story and not language-specific rules," says Altmann. Their findings could be used in future to improve Internet search engines, and they could also help to analyse texts and identify plagiarism.

Explore further: Researchers help Boston Marathon organizers plan for 2014 race

More information: Eduardo G. Altmann, Giampaolo Cristadoro and Mirko Degli Esposti, On the origin of long-range correlations in texts, PNAS, July 2, 2012, doi: 10.1073/pnas.1117723109

## Related Stories

#### Texting affects ability to interpret words

Feb 20, 2012

(Medical Xpress) -- Research designed to understand the effect of text messaging on language found that texting has a negative impact on people's linguistic ability to interpret and accept words.

#### Wider letter spacing helps dyslexics read: study

Jun 04, 2012

European researchers said Monday that offering reading materials with wider spacing between the letters can help dyslexic children read faster and better.

#### What was that again? A mathematical model of language incorporates the need for repetition

Aug 29, 2011

As politicians know, repetition is often key to getting your message across. Now a former physicist studying linguistics at the Polish Academy of Sciences has taken this intuitive concept and incorporated it into a mathematical ...

#### Texts to reveal 'Whodunnit'

Aug 10, 2006

Psychologists at the University of Leicester are to investigate texting language to provide new tools for criminal investigation.

#### Extramural exposure leads to more varied use of English by 16-year-olds

Feb 22, 2011

Pupils who devote much of their spare time to activities involving exposure to English, such as computer games and films, are thought to vary their use of language more in their written work than pupils with less extramural ...

#### 'DTXTR' translates teen text into English

May 27, 2009

Wot r ur kids txting? If you're wondering -- or 1dering -- there's a new online translation tool that helps decipher the code.

## Recommended for you

#### Can new understanding avert tragedy?

2 hours ago

As a boy growing up in Syracuse, NY, Sol Hsiang ran an experiment for a school project testing whether plants grow better sprinkled with water vs orange juice. Today, 20 years later, he applies complex statistical ...

#### Crowd-sourcing Britain's Bronze Age

2 hours ago

A new joint project by the British Museum and the UCL Institute of Archaeology is seeking online contributions from members of the public to enhance a major British Bronze Age archive and artefact collection.

#### Roman dig 'transforms understanding' of ancient port

2 hours ago

(Phys.org) —Researchers from the universities of Cambridge and Southampton have discovered a new section of the boundary wall of the ancient Roman port of Ostia, proving the city was much larger than previously ...

#### Study looks at stock market performance of polarizing brands

3 hours ago

Are you a big fan of Apple or Nike, or a hater of McDonald's? A new study from the W. P. Carey School of Business at Arizona State University shows love-it or hate-it brands probably won't perform exceptionally ...

#### Which foods may cost you more due to Calif. drought

3 hours ago

With California experiencing one of its worst droughts on record, grocery shoppers across the country can expect to see a short supply of certain fruits and vegetables in stores, and to pay higher prices ...

#### Creative activities outside work can improve job performance

13 hours ago

Employees who pursue creative activities outside of work may find that these activities boost their performance on the job, according to a new study by San Francisco State University organizational psychologist Kevin Eschleman ...

## More news stories

#### Can new understanding avert tragedy?

As a boy growing up in Syracuse, NY, Sol Hsiang ran an experiment for a school project testing whether plants grow better sprinkled with water vs orange juice. Today, 20 years later, he applies complex statistical ...

#### Crowd-sourcing Britain's Bronze Age

A new joint project by the British Museum and the UCL Institute of Archaeology is seeking online contributions from members of the public to enhance a major British Bronze Age archive and artefact collection.

#### Roman dig 'transforms understanding' of ancient port

(Phys.org) —Researchers from the universities of Cambridge and Southampton have discovered a new section of the boundary wall of the ancient Roman port of Ostia, proving the city was much larger than previously ...

#### Earliest ancestor of land herbivores discovered

New research from the University of Toronto Mississauga demonstrates how carnivores transitioned into herbivores for the first time on land.

#### Study looks at stock market performance of polarizing brands

Are you a big fan of Apple or Nike, or a hater of McDonald's? A new study from the W. P. Carey School of Business at Arizona State University shows love-it or hate-it brands probably won't perform exceptionally ...

#### Researchers find tin selenide shows promise for efficiently converting waste heat into electrical energy

(Phys.org) —A team of researchers working at Northwestern University has found that tin selenide (SnSe) has the highest Carnot efficiency for a thermoelectric cycle ever found, making it potentially a possible ...

(Phys.org) —Google engineers working on software to automatically read home and business addresses off photographs taken by Street View vehicles, have created a product so good that not only can it be used ...

#### Net neutrality balancing act

Researchers in Italy, writing in the International Journal of Technology, Policy and Management have demonstrated that net neutrality benefits content creator and consumers without compromising provider innovation nor pr ...

#### Research sheds new light on impact of diabetes on the brain

Researchers from the University of Sheffield and Sheffield Teaching Hospitals NHS Foundation Trust have discovered diabetic nerve damage causes more harm in the brain than previously thought, shedding new ...

#### India's ancient mammals survived multiple pressures

Most of the mammals that lived in India 200,000 years ago still roam the subcontinent today, in spite of two ice ages, a volcanic super-eruption and the arrival of people, a study reveals.