Linguists to re-think reason for short words

( -- Linguists have thought for many years the length of words is related to the frequency of use, with short words used more often than long ones. Now researchers in the US have shown the length is more closely related to the amount of information the words carry than their frequency of use.

A link between the length of words and how frequently they are used was first proposed in 1935 by George Kingsley Zipf, a Harvard University and philologist. Zipf's idea was that people would tend to shorten words they used often, to save time in writing and speaking. The relationship seems intuitive and it seems to apply to many languages with short words such as “the”, “a”, “to”, “and”, “so” (and equivalents in other languages) being frequently used.

Researchers at the Massachusetts Institute of Technology (MIT), led by Steven Piantadosi, tested the Zipf relationship by analysing word use in 11 European languages. They analyzed digitized texts for correlations between words by counting how often all pairs of words occurred in sequence. This information was then used to estimate the probability of words occurring after given previous words or sequences of words. They made the assumption that the more predictable a word is, the less information it conveys, and estimated the information content from information theory, which says the information content is proportional to the negative logarithm of the probability of a word occurring.

Piantadosi said if the word length is directly related to information content this would make the transmission of information through more efficient and also make speech and written texts easier to understand. This is because shorter words, carrying less information, would be scattered through the speech, essentially “smoothing out” the information density and delivering the important information at a steady rate.

The studies suggest that the short words are in fact the least informative and most predictable words rather than the most often used, and that word length is more closely related to the information the contain.

The paper is soon to be published in the Proceedings of the National Academy of Sciences (PNAS). Steven Piantadosi belongs to the PhD program with MIT’s Department of Brain and Cognitive Sciences.

Explore further

As long as original version still available, tweaking Twain is OK, professor says

More information: Piantadosi, S. T., et al. Proceedings of the National Academy of Sciences (2011). PNAS paper will appear online at

© 2010

Citation: Linguists to re-think reason for short words (2011, January 25) retrieved 20 October 2019 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Feedback to editors

User comments

Jan 25, 2011
I wonder how this relates to the need for a certain degree of redundancy in natural languages. Redundancy is required because speech often occurs in a noisy context. Noisy sound is unpredictable and can take various forms in that it may be pulses of sound which can have different durations and different dynamic variations, or it can be persistent sound which masks certain frequencies.

One example of simple redundancy is the prepositions [in, at, on, near, into, out, etc] which can sometimes add essential clarity to utterances but often do not add much meaning. And in European languages for example there are 'necessary' agreements - embodied as inflections - between adjectives and the nouns they are qualifying. The success of English with its relatively few such inflections [compared to Russian for example] show that much of this is just redundant. So why is it there?

Jan 25, 2011
This article illustrates throwing out the baby with the bathwater. Scientists should be the most open to seeing grays instead of black and whites.

Shouldn't this scientist consider all possible explanations? Certainly compound words and words with suffixes and prefixes are going to be longer and convey more meaning through building blocks (Grandmother, defensive, flammable, chemical names).

The usage argument still applies, because less "meaningful" words are necessary to use more often in speech to give it context.

Sure better examples
I'm sure that there are better examples than this one

Longer words with more information have to be used less, because their specific meanings don't apply as often. I would use 'inflammable' less than 'to' because it doesn't apply in as many situations.

and...lets face it, we think of simple words as simple things. Car and automobile mean the same thing even though this study would consider one to have more meaning

Jan 26, 2011
While the placements of the words might not convey information many of the 'small' words denote relationships between the intrinsically information carrying (i.e. long) words. Such modifiers can have huge impact on the _type_ of information conveyed without changing the _amount_ of information conveyed
(e.g. "the tire is in the trunk" vs. "the tire is on the trunk")

The information carrying capacity of words (especially the short ones) can't be only be judged by probability of occurence since they are always embedded (quite literally) in a context.

Jan 27, 2011
and...lets face it, we think of simple words as simple things. Car and automobile mean the same thing even though this study would consider one to have more meaning

long ago, when I wss a kid, the wonder of television revealed to my little British soul that Americans always said "automobile" when they meant "car". Nowadays I never hear the word automobile in an American movie or TV program, it is always "car".

Perhaps this change has something to do with the [apparent] fact that Americans also commonly use[d?] the word "car" when referring to railway carriages and trams [as in "Streetcar named Desire"]. The decline of railways as people transport medium maybe opened the way for car to displace automobile as it has.

Jan 30, 2011
latin, a highly precise language get uses many longer words to make very subtle distinctions.
De gustibus non disputandum est cover a lot of territory in few words

Feb 01, 2011

the year I studied Latin [care of some aged soul we called "Yob"] I achieved an overall mark of 36%. Luckily nobody of importance to me thought that this was of any particular importance.

On-line translators this evening seem to imply that "est" in the quote should come before "disputandum". 36% notwithstanding, I think that makes better sense too ... :-)

Feb 06, 2011
It's RISC for human language instead of processor instruction sets.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more