August 14, 2008

Computer users are digitizing books quickly and accurately with Carnegie Mellon method

Millions of computer users collectively transcribe the equivalent of 160 books each day with better than 99 percent accuracy, despite the fact that few spend more than a few seconds on the task and that most do not realize they are doing valuable work, Carnegie Mellon University researchers reported today in Science Express.

They can work so prodigiously because Carnegie Mellon computer scientists led by Luis von Ahn have taken a widely used Web site security measure, called a CAPTCHA, and given it a second purpose — digitizing books produced prior to the computer age. When Web visitors solve one of the distorted-letter puzzles so they can register for email or post a comment on a blog, they simultaneously help turn the printed word into machine-readable text.

More than a year after implementing their version, called reCAPTCHA, recaptcha.net/ on thousands of Web sites worldwide, the researchers conclude that their word deciphering process achieves the industry standard for human transcription services — better than 99 percent accuracy. Their report, published online today, will appear in an upcoming issue of the journal Science.

Furthermore, the amount of work that can be accomplished is herculean. More than 100 million CAPTCHAs are solved every day and, though each puzzle takes only a few seconds to solve, the aggregate amount of time translates into hundreds of thousands of hours of human effort that can potentially be tapped. During the reCAPTCHA system's first year of operation, more than 1.2 billion reCAPTCHAs have been solved and more than 440 million words have been deciphered. That's the equivalent of manually transcribing more than 17,600 books.

"More Web sites are adopting reCAPTCHAs each day, so the rate of transcription keeps growing," said von Ahn, an assistant professor in the School of Computer Science's Computer Science Department. "More than 4 million words are being transcribed every day. It would take more than 1,500 people working 40 hours a week at a rate of 60 words a minute to match our weekly output."

Von Ahn said reCAPTCHAs are being used to digitize books for the Internet Archive and to digitize newspapers for The New York Times. Digitization allows older works to be indexed, searched, reformatted and stored in the same way as today's online texts.

Old texts are typically digitized by photographically scanning pages and then transforming the text using optical character recognition (OCR) software. But when ink has faded and paper has yellowed, OCR sometimes can't recognize some words — as many as one out of every five, according to the Carnegie Mellon team's tests. Without reCAPTCHA, these words must be deciphered manually at great expense.

Conventional CAPTCHAs, which were developed at Carnegie Mellon, involve letters and numbers whose shapes have been distorted or backgrounds altered so that computers can't recognize them, but humans can. To create reCAPTCHAs, the researchers use images of words from old texts that OCR systems have had trouble reading.

Helping to make old books and newspapers more accessible to a computerized world is something that the researchers find rewarding, but is only part of a larger goal. "We are demonstrating that we can take human effort — human processing power — that would otherwise be wasted and redirect it to accomplish tasks that computers cannot yet solve," von Ahn said.

For instance, he and his students have developed online games, available at www.gwap.com , that analyze photos and audio recordings — tasks beyond the capability of computers. Similarly, University of Washington biologists recently built Fold It, fold.it/ , a game in which people compete to determine the ideal structure of a given protein.

In addition to von Ahn, authors of the new report include computer science undergraduate Benjamin Maurer, graduate students Colin McMillen and David Abraham, and Manuel Blum, professor of computer science.

Source: Carnegie Mellon University

Citation: Computer users are digitizing books quickly and accurately with Carnegie Mellon method (2008, August 14) retrieved 21 September 2024 from https://phys.org/news/2008-08-users-digitizing-quickly-accurately-carnegie.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Scientists determine that connexin molecules allow cells to send messages to each other

0 shares

Feedback to editors

'Pirate birds' force other seabirds to regurgitate fish meals. Their thieving ways could spread lethal avian flu

2 hours ago

Even the heaviest particles experience the usual quantum weirdness, new experiment shows

2 hours ago

New method developed to relocate misplaced proteins in cells

3 hours ago

New biosensor illuminates physiological signals in living animals

3 hours ago

New tool to help decision makers navigate possible futures of the Colorado River

3 hours ago

Many people in the Pacific lack access to adequate toilets—and climate change makes things worse

3 hours ago

Saturday Citations: Football metaphors in physics; vets treat adorable baby rhino's broken leg

6 hours ago

New data science tool greatly speeds up molecular analysis of our environment

23 hours ago

AI tools help uncover enzyme mechanisms for lasso peptides

23 hours ago

Light momentum turns pure silicon from an indirect to a direct bandgap semiconductor

Sep 20, 2024

Load comments (4)

Computer users are digitizing books quickly and accurately with Carnegie Mellon method

'Pirate birds' force other seabirds to regurgitate fish meals. Their thieving ways could spread lethal avian flu

Even the heaviest particles experience the usual quantum weirdness, new experiment shows

New method developed to relocate misplaced proteins in cells

New biosensor illuminates physiological signals in living animals

New tool to help decision makers navigate possible futures of the Colorado River

Many people in the Pacific lack access to adequate toilets—and climate change makes things worse

Saturday Citations: Football metaphors in physics; vets treat adorable baby rhino's broken leg

New data science tool greatly speeds up molecular analysis of our environment

AI tools help uncover enzyme mechanisms for lasso peptides

Light momentum turns pure silicon from an indirect to a direct bandgap semiconductor

Relevant PhysicsForums posts

Container shrinks at certain screen widths (CSS)

Unsolvable python code bug? (finding the difference between two input strings)

User-Defined Functions in Sql Server SSMS

Can Fortran 77 Code Be Used to Debug Python Code for Solving ODEs Using Radau5?

Help solving a geometrical matching issue with Graph Neural Networks

Zipping identical iterables

Scientists determine that connexin molecules allow cells to send messages to each other

Mapping media bias: How AI powers a new media bias detector

Researchers capture detailed picture of electron acceleration in one shot

New insights into mechanical weakness of twisted carbon nanotube yarns

Biologists uncover how key carbohydrate-attachment mechanism malfunctions

Event camera integrates Fourier light field microscopy for ultrafast 3D imaging

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Computer users are digitizing books quickly and accurately with Carnegie Mellon method

'Pirate birds' force other seabirds to regurgitate fish meals. Their thieving ways could spread lethal avian flu

Even the heaviest particles experience the usual quantum weirdness, new experiment shows

New method developed to relocate misplaced proteins in cells

New biosensor illuminates physiological signals in living animals

New tool to help decision makers navigate possible futures of the Colorado River

Many people in the Pacific lack access to adequate toilets—and climate change makes things worse

Saturday Citations: Football metaphors in physics; vets treat adorable baby rhino's broken leg

New data science tool greatly speeds up molecular analysis of our environment

AI tools help uncover enzyme mechanisms for lasso peptides

Light momentum turns pure silicon from an indirect to a direct bandgap semiconductor

Relevant PhysicsForums posts

Related Stories

Scientists determine that connexin molecules allow cells to send messages to each other

Mapping media bias: How AI powers a new media bias detector

Researchers capture detailed picture of electron acceleration in one shot

New insights into mechanical weakness of twisted carbon nanotube yarns

Biologists uncover how key carbohydrate-attachment mechanism malfunctions

Event camera integrates Fourier light field microscopy for ultrafast 3D imaging

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience