Unzipping Zipf's Law: Solution to a century-old linguistic problem

August 10, 2017, Radboud University
Solution to a century-old linguistic problem
Figure 1. Zipfian distribution of the frequency (vertical axes) and the rank in the frequency table (horizontal axes) of the first hundred words of Melvilles Moby Dick. The line was predicted by Zipf’s law, and the dots depict the actual word frequencies in the text. Credit: Radboud University

Did you know that in every language, the most frequent word occurs twice as often as the second most frequent word? This phenomenon called 'Zipf's law' is more than one century old, but until now, scientists have not been able to elucidate it exactly. Sander Lestrade, a linguist at Radboud University in The Netherlands, proposes a new solution to this notorious problem in PLOS ONE.

Zipf's law describes how the frequency of a word in , is dependent on its rank in the frequency table. So the most frequent word occurs twice as often as the second most frequent word, three times as often as the subsequent word, and so on until the least frequent word (see Figure 1). The law is named after the American George Kingsley Zipf, who was the first who tried to explain it around 1935.

Biggest mystery in computational linguistics

"I think it's safe to say that Zipf's law is the biggest mystery in ," says Sander Lestrade, linguist at Radboud University in Nijmegen, the Netherlands. "In spite of decades of theorizing, its origins remain elusive." Lestrade now shows that Zipf's law can be explained by the interaction between the structure of sentences (syntax) and the meaning of words (semantics) in a text. Using computer simulations, he was able to show that neither syntax or semantics suffices to induce a Zipfian distribution on its own, but that syntax and semantics 'need' each other for that.

"In the English language, but also in Dutch, there are only three articles, and tens of thousands of nouns," Lestrade explains. "Since you use an article before almost every , articles occur way more often than nouns." But that is not enough to explain Zipf's law. "Within the nouns, you also find big differences. The word 'thing', for example, is much more common than 'submarine', and thus can be used more frequently. But in order to actually occur frequently, a word should not be too general either. If you multiply the differences in meaning within word classes, with the need for every word class, you find a magnificent Zipfian distribution. And this distribution only differs a little from the Zipfian ideal, just like natural language does, as you can see in Figure 1."

Not only are predictions based on Lestrades new model completely consistent with phenomena found in natural language, his theory also holds for almost every in the world, not only for English or Dutch. Lestrade: "I am overjoyed with this finding, and I am convinced of my theory. Still, its confirmation must come from other linguists."

Explore further: Surprising mathematical law tested on Project Gutenberg texts

More information: Sander Lestrade et al. Unzipping Zipf's law, PLOS ONE (2017). DOI: 10.1371/journal.pone.0181987

Related Stories

Surprising mathematical law tested on Project Gutenberg texts

February 22, 2016

Zipf's law in its simplest form, as formulated in the thirties by American linguist George Kingsley Zipf, states surprisingly that the most frequently occurring word in a text appears twice as often as the next most frequent ...

Applying Zipf's Law to galaxies

April 18, 2016

In the last century, the linguist George Zipf noticed that the second most common word in English ("of") was used about half as often as the most common word ("the"), the third most common word ("and") occurred about one-third ...

Linguists to re-think reason for short words

January 25, 2011

(PhysOrg.com) -- Linguists have thought for many years the length of words is related to the frequency of use, with short words used more often than long ones. Now researchers in the US have shown the length is more closely ...

What the pupils tells us about language

June 15, 2017

The meaning of a word is enough to trigger a reaction in our pupil: when we read or hear a word with a meaning associated with luminosity ("sun," "shine," etc.), our pupils contract as they would if they were actually exposed ...

Physicists eye neural fly data, find formula for Zipf's law

August 5, 2014

Physicists have identified a mechanism that may help explain Zipf's law – a unique pattern of behavior found in disparate systems, including complex biological ones. The journal Physical Review Letters is publishing their ...

Recommended for you

Meteorite source in asteroid belt not a single debris field

February 17, 2019

A new study published online in Meteoritics and Planetary Science finds that our most common meteorites, those known as L chondrites, come from at least two different debris fields in the asteroid belt. The belt contains ...

Diagnosing 'art acne' in Georgia O'Keeffe's paintings

February 17, 2019

Even Georgia O'Keeffe noticed the pin-sized blisters bubbling on the surface of her paintings. For decades, conservationists and scholars assumed these tiny protrusions were grains of sand, kicked up from the New Mexico desert ...

Archaeologists discover Incan tomb in Peru

February 16, 2019

Peruvian archaeologists discovered an Incan tomb in the north of the country where an elite member of the pre-Columbian empire was buried, one of the investigators announced Friday.

Where is the universe hiding its missing mass?

February 15, 2019

Astronomers have spent decades looking for something that sounds like it would be hard to miss: about a third of the "normal" matter in the Universe. New results from NASA's Chandra X-ray Observatory may have helped them ...

What rising seas mean for local economies

February 15, 2019

Impacts from climate change are not always easy to see. But for many local businesses in coastal communities across the United States, the evidence is right outside their doors—or in their parking lots.


Adjust slider to filter visible comments by rank

Display comments: newest first

Spaced out Engineer
not rated yet Aug 10, 2017
Maybe it is far simpler than semantics. Perhaps it is a transactional and symbolic manipulation account. Minimal syllables for maximal differentiation. Evolution should aspire towards effective communication.

Water, way, people, to. Walk, talk, bark, spit.
So a lazy phonemic means to quickly get to the doing.
Even in a society of recreational storytelling this could work to maximize the rate of immersion.
5 / 5 (1) Aug 11, 2017
"If you multiply the differences in meaning within word classes, with the need for every word class, . . . " I am not a linguist, but have an MA in English, so I'm thinking I can easily look up "word classes" as I've come across that term before. But "the NEED? for every word class" that really has me stumped. What could it be? I can't then do the multiplication.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.