Engineering students fix common glitch in digitization of books published before 1700

December 28, 2015, Northwestern University
Engineering students fix common glitch in digitization of books published before 1700
Computer scientists use machine learning to correct glitches.

Digitizing books published before 1700 has created an aesthetic as well as quite pragmatic "black-dot problem" in translated texts, with the word "love," for example, showing up as "lo•e."

Taking the digital savvy of today's age one step farther, Northwestern University engineering students in the McCormick School of Engineering and Applied Sciences have come to the rescue of the marred and sometimes indecipherable words that populate the translated versions of the early English texts.

Working in conjunction with undergraduates from the Weinberg College of Arts and Sciences, the designed a computer program that uses language modeling, akin to autocorrect and voice-recognition programs, to help fill in the blanks of the incomplete words.

The dots creep into the process because of the difficulties of translating aged texts that often are browned, splotchy and cut off at the margins. When translators could not read or understand a portion of a text, they replaced an unknown character with a black dot.

Since 1999, about 50,000 texts have been transcribed by the non-profit Text Creation Partnership, but the works have roughly 5 million incomplete words. The translations of the tattered books also were further compromised by poor-quality scans.

Language modeling finds misspellings and "blackdot words" created when the computer encounters an unknown character. Once an error is found, nearby characters are evaluated and replacement suggestions are made, with a probability assigned to each option based on the context.

The word "lo•e" might be "love," but it also might be "lone," "lore," or "lose." A language model uses context to choose the correct option. If the context is "she was in lo•e with him," then the program assumes the missing word is, indeed, "love."

Last summer, Weinberg students worked on the language riddles by combing through the options and selecting the correct one. Engineering students, meanwhile, have built a site where humanities scholars can search for words in different texts and fix errors on the spot. Super users then either accept or reject the corrections.

"Machine learners can also learn from that feedback," said project leader Doug Downey, associate professor of electrical engineering and computer science at the McCormick School of Engineering. "A little bit of crowdsourcing like that could go a long way. Eventually, we could have super high-quality transcriptions."

Modern readers could arguably comb through the texts and fix all the errors, but it could take several minutes for a human to fix just one error, said Martin Mueller, professor emeritus of English and classics at Northwestern. To tackle all of the errors, it would take one person years of non-stop work—an impractical, if not humanly impossible, task.

The collaboration's initial results indicate that approximately three-quarters of the incompletely or incorrectly transcribed works can be definitively corrected with a combination of machine learning and machine-assisted editing—without the need to consult the original printed text. This could drastically reduce the human-time cost from minutes to seconds per word.

Explore further: Linguistics researcher uses pop music to teach vocabulary

Related Stories

Linguistics researcher uses pop music to teach vocabulary

December 16, 2015

Friederike Tegge, who has taught German and English, was inspired to conduct research on pop music when she observed that many of her students showed a surprising memory for song lyrics in a foreign language and could repeat ...

Learning language by playing games

September 24, 2015

MIT researchers have designed a computer system that learns how to play a text-based computer game with no prior assumptions about how language works. Although the system can't complete the game as a whole, its ability to ...

Texting affects ability to interpret words

February 20, 2012

(Medical Xpress) -- Research designed to understand the effect of text messaging on language found that texting has a negative impact on people's linguistic ability to interpret and accept words.

Recommended for you

Finnish firm detects new Intel security flaw

January 12, 2018

A new security flaw has been found in Intel hardware which could enable hackers to access corporate laptops remotely, Finnish cybersecurity specialist F-Secure said on Friday.

1 comment

Adjust slider to filter visible comments by rank

Display comments: newest first

not rated yet Dec 30, 2015
why use extremely small image we cant see

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.