Linguists to re-think reason for short words

Jan 25, 2011 by Lin Edwards report

(PhysOrg.com) -- Linguists have thought for many years the length of words is related to the frequency of use, with short words used more often than long ones. Now researchers in the US have shown the length is more closely related to the amount of information the words carry than their frequency of use.

A link between the length of words and how frequently they are used was first proposed in 1935 by George Kingsley Zipf, a Harvard University and philologist. Zipf's idea was that people would tend to shorten words they used often, to save time in writing and speaking. The relationship seems intuitive and it seems to apply to many languages with short words such as “the”, “a”, “to”, “and”, “so” (and equivalents in other languages) being frequently used.

Researchers at the Massachusetts Institute of Technology (MIT), led by Steven Piantadosi, tested the Zipf relationship by analysing word use in 11 European languages. They analyzed digitized texts for correlations between words by counting how often all pairs of words occurred in sequence. This information was then used to estimate the probability of words occurring after given previous words or sequences of words. They made the assumption that the more predictable a word is, the less information it conveys, and estimated the information content from information theory, which says the information content is proportional to the negative logarithm of the probability of a word occurring.

Piantadosi said if the word length is directly related to information content this would make the transmission of information through more efficient and also make speech and written texts easier to understand. This is because shorter words, carrying less information, would be scattered through the speech, essentially “smoothing out” the information density and delivering the important information at a steady rate.

The studies suggest that the short words are in fact the least informative and most predictable words rather than the most often used, and that word length is more closely related to the information the contain.

The paper is soon to be published in the Proceedings of the National Academy of Sciences (PNAS). Steven Piantadosi belongs to the PhD program with MIT’s Department of Brain and Cognitive Sciences.

Explore further: Passengers boarding airplanes—we're doing it wrong

More information: Piantadosi, S. T., et al. Proceedings of the National Academy of Sciences (2011). PNAS paper will appear online at dx.doi.org/10.1073/pnas.1012551108

Related Stories

Responses shift when changing languages

Nov 03, 2010

The language we speak may influence not only our thoughts, but our implicit preferences as well. That's the finding of a study by Harvard psychologists, who found that bilingual individuals’ opinions ...

True or false? How our brain processes negative statements

Feb 11, 2009

Every day we are confronted with positive and negative statements. By combining the new, incoming information with what we already know, we are usually able to figure out if the statement is true or false. Previous research ...

Measured -- The time it takes us to find the words we need

Nov 23, 2009

(PhysOrg.com) -- The time it takes for our brains to search for and retrieve the word we want to say has been measured for the first time. The discovery is reported in a paper published in the Proceedings of the National Ac ...

Brain fends off distractions

Mar 20, 2007

Dutch researcher Harm Veling has demonstrated that our brains fend off distractions. If we are busy with something we suppress disrupting external influences. If we are tired, we can no longer do this.

Recommended for you

Woolly mammoth skeleton sold at UK auction

4 hours ago

The skeleton of an Ice Age woolly mammoth fetched £189,000 ($300,000, 239.000 euros) at auction Wednesday as it went under the hammer in Britain with a host of other rare or extinct species.

User comments : 8

Adjust slider to filter visible comments by rank

Display comments: newest first

A_Paradox
not rated yet Jan 25, 2011
I wonder how this relates to the need for a certain degree of redundancy in natural languages. Redundancy is required because speech often occurs in a noisy context. Noisy sound is unpredictable and can take various forms in that it may be pulses of sound which can have different durations and different dynamic variations, or it can be persistent sound which masks certain frequencies.

One example of simple redundancy is the prepositions [in, at, on, near, into, out, etc] which can sometimes add essential clarity to utterances but often do not add much meaning. And in European languages for example there are 'necessary' agreements - embodied as inflections - between adjectives and the nouns they are qualifying. The success of English with its relatively few such inflections [compared to Russian for example] show that much of this is just redundant. So why is it there?
that_guy
1 / 5 (1) Jan 25, 2011
This article illustrates throwing out the baby with the bathwater. Scientists should be the most open to seeing grays instead of black and whites.

Shouldn't this scientist consider all possible explanations? Certainly compound words and words with suffixes and prefixes are going to be longer and convey more meaning through building blocks (Grandmother, defensive, flammable, chemical names).

The usage argument still applies, because less "meaningful" words are necessary to use more often in speech to give it context.

Sure better examples
or
I'm sure that there are better examples than this one

Longer words with more information have to be used less, because their specific meanings don't apply as often. I would use 'inflammable' less than 'to' because it doesn't apply in as many situations.

and...lets face it, we think of simple words as simple things. Car and automobile mean the same thing even though this study would consider one to have more meaning
antialias
not rated yet Jan 26, 2011
While the placements of the words might not convey information many of the 'small' words denote relationships between the intrinsically information carrying (i.e. long) words. Such modifiers can have huge impact on the _type_ of information conveyed without changing the _amount_ of information conveyed
(e.g. "the tire is in the trunk" vs. "the tire is on the trunk")

The information carrying capacity of words (especially the short ones) can't be only be judged by probability of occurence since they are always embedded (quite literally) in a context.
A_Paradox
5 / 5 (1) Jan 27, 2011
@that_guy
and...lets face it, we think of simple words as simple things. Car and automobile mean the same thing even though this study would consider one to have more meaning


long ago, when I wss a kid, the wonder of television revealed to my little British soul that Americans always said "automobile" when they meant "car". Nowadays I never hear the word automobile in an American movie or TV program, it is always "car".

Perhaps this change has something to do with the [apparent] fact that Americans also commonly use[d?] the word "car" when referring to railway carriages and trams [as in "Streetcar named Desire"]. The decline of railways as people transport medium maybe opened the way for car to displace automobile as it has.
po6ert
not rated yet Jan 30, 2011
latin, a highly precise language get uses many longer words to make very subtle distinctions.
De gustibus non disputandum est cover a lot of territory in few words
A_Paradox
not rated yet Feb 01, 2011
po6ert,

the year I studied Latin [care of some aged soul we called "Yob"] I achieved an overall mark of 36%. Luckily nobody of importance to me thought that this was of any particular importance.

On-line translators this evening seem to imply that "est" in the quote should come before "disputandum". 36% notwithstanding, I think that makes better sense too ... :-)
frajo
not rated yet Feb 01, 2011
De gustibus non disputandum est

On-line translators this evening seem to imply that "est" in the quote should come before "disputandum".
Online translators cannot be trusted for more than single words.
In fact, you have to realize that ancient times were essentially times of remembering the spoken word. No citation databases, no books "Latin for dummies". Thus, the sound of the spoken word was of utmost importance. (The oldest European literature, Odyssey and Iliad, is completely written in hexameters.)
And now listen to yourself reciting first "de gustibus non disputandum est" and then "De gustibus non est disputandum". The first is a hexameter, sticking to the ear, while the second is of the kind you have forgotten before the next adage.
eryksun
not rated yet Feb 06, 2011
It's RISC for human language instead of processor instruction sets.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.