Wiki ranking: Bayesian statistics can score Wikipedia entries

August 6, 2014, Inderscience Publishers

Wikipedia the free, online collaborative encyclopedia is an important source of information. However, while the team of volunteer editors endeavors to maintain high standards, there are occasionally problems with the veracity of content, deliberate vandalism and incomplete entries. Writing in the International Journal of Information Quality, computer scientists in China have devised a software algorithm that can automatically check a particular entry and rank it according to quality.

Jingyu Han and Kejia Chen of Nanjing University of Posts and Telecommunications, explain that the quality of data on Wikipedia has for many years been the focus of user attention. Its detractors suggest that it can never be a valid information source in the way that a proprietary encyclopedia might be because the contributors and editors are not under the direct control of a single publisher with a vested interest in . Its supporters suggest that the social nature of contributions and edits and the online tracking of changes is one of Wikipedia's greatest strengths rather than a weakness.

Nevertheless, it would quiet the detractors if there were a way to quantify the quality of Wikipedia entries in an objective and automated manner. Now, Han and Chen have turned to Bayesian statistics to help them create just such a system. The notion of finding evidence based on an analysis of probabilities was first described by 18th Century mathematician and theologian Thomas Bayes. Bayesian probabilities were then utilized by Pierre-Simon Laplace to pioneer a new statistical method. Today, Bayesian analysis is commonly used to assess the content of emails and to determine the probability that the content is spam, junk mail, and so filter it from the user's inbox if the probability is high.

Han and Chen have now used dynamic Bayesian network (DBN) to analyze in a similar manner the content of Wikipedia entries. They apply multivariate Gaussian distribution modeling to the DBN analysis, which gives them a distribution of the quality of each article so that entries might be ranked. Very low-ranking entries might be flagged for editorial attention to raise the quality. By contrast, high-ranking entries could be marked in some way as the definitive entry so that such an entry is not subsequently overwritten with lower quality information.

The team has tested its algorithm on sets of several hundred articles comparing the automated quality assessment by the computer with assessment by a human user. Their algorithm out-performs a human user by up to 23 percent in correctly classifying the quality rank of a given article in the set, the team reports. The use of a computerized system to provide a quality standard for Wikipedia entries would avoid the subjective need to have people classify each entry. It could thus improve the standard as well as provide a basis for an improved reputation for the online encyclopedia.

Explore further: Wikipedia blocks 'disruptive' edits from US Congress

More information: Han, J. and Chen, K. (2014) 'Ranking Wikipedia article's data quality by learning dimension distributions', Int. J. Information Quality, Vol. 3, No. 3, pp.207.

Related Stories

Wikipedia losing editors, study says

January 4, 2013

Wikipedia, one of the world's biggest websites, is losing many of its English-language editors, crippling its ability to keep pace with its mission as a source of knowledge online, a study says.

Most Wikipedia health articles contain errors

May 28, 2014

(HealthDay)—Ninety percent of health articles on Wikipedia contain errors, according to a new study published in the May issue of the Journal of the American Osteopathic Association.

Recommended for you

NASA instruments image fireball over Bering Sea

March 22, 2019

On Dec. 18, 2018, a large "fireball—the term used for exceptionally bright meteors that are visible over a wide area—exploded about 16 miles (26 kilometers) above the Bering Sea. The explosion unleashed an estimated 173 ...

Paleontologists report world's biggest Tyrannosaurus rex

March 22, 2019

University of Alberta paleontologists have just reported the world's biggest Tyrannosaurus rex and the largest dinosaur skeleton ever found in Canada. The 13-metre-long T. rex, nicknamed "Scotty," lived in prehistoric Saskatchewan ...

Coffee-based colloids for direct solar absorption

March 22, 2019

Solar energy is one of the most promising resources to help reduce fossil fuel consumption and mitigate greenhouse gas emissions to power a sustainable future. Devices presently in use to convert solar energy into thermal ...


Adjust slider to filter visible comments by rank

Display comments: newest first

3 / 5 (2) Aug 06, 2014
Can be very misleading in some cases, giving high quality for completely wrong thinking of a very large number of people !!
I have written in wikipedia and I can show scientific errors remaining over many years, after my text was censured and modified steadily by lobby writers, so that I stopped the war against them to suppress the error !!!
Any thing true, but disturbing for lobbys is censured in wikipedia !!

3 / 5 (2) Aug 06, 2014
read how wikipedia can be misleading :
3 / 5 (2) Aug 06, 2014
not rated yet Aug 07, 2014
One problem (depending on your perspective) with reducing the scoring process to an algorithm is you now give people who would "optimize" the ranking of their work a simple way to do so. We would see a bias towards towards those who strongly wish us to think what they want us to, and have no ethics about gaming the system. Sometimes that might coincide with truth and accuracy.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.