share this!
4
5
Share
Email

April 24, 2019

OCR4all: Modern tool for old texts

OCR: Modern tool for old texts — Page from a french version of the "Narrenschiff" (Ship of Fools). Such old fonts can be reliably converted into computer-readable text with OCR4all. Credit: Dresden State and University Library, CC BY-SA 4.0

Historians and other humanities' scholars often have to deal with difficult research objects: centuries-old printed works that are difficult to decipher and often in an unsatisfactory state of conservation. Many of these documents have now been digitized—usually photographed or scanned—and are available online worldwide. For research purposes, this is already a step forward.

However, there is still a challenge to overcome: bringing the digitized old fonts into a modern form with text recognition software that is readable for non-specialists as well as for computers. Scientists at the Center for Philology and Digitality at Julius-Maximilians-Universität Würzburg (JMU) in Bavaria, Germany, have made a significant contribution to further development in this field.

With OCR4all, the JMU research team is making a new tool available to the scientific community. It converts digitized historical prints with an error rate of less than one percent into computer-readable texts. And it offers a graphical user interface that requires no IT expertise. With previous tools of this kind, user-friendliness was not always a given, as the users mostly had to work with programming commands.

Developed in cooperation with the humanities

The new OCR4all tool was developed under the direction of Christian Reul together with his computer science colleagues Professor Frank Puppe (Chair of Artificial Intelligence and Applied computer science) and Christoph Wick as well as Uwe Springmann (Digital Humanities expert) and numerous students and assistants.

OCR4all originates from the JMU Kallimachos project, which is funded by the German Federal Ministry of Education and Research. This cooperation between the humanities and computer science will be continued and institutionalized in the newly founded JMU Center for Philology and Digitality.

In developing OCR4all, computer scientists have collaborated with the humanities at JMU—including German and Romance studies and literature studies in the project "Narragonien digital." The aim was to digitize the "Narrenschiff," a moral satire by Sebastian Brant, a bestseller of the 15th century that was translated into many languages. Furthermore, OCR4all has been frequently used in the JMU's Kolleg "Medieval and Early Modern Times."

OCR4all is freely available to the public on the GitHub platform (with instructions and examples): https://github.com/OCR4all

Each print shop had its own font

Christian Reul explains the challenges involved in the development of OCR4all: Automatic text recognition (OCR = Optical Character Recognition) has been working very well for modern fonts for some time now. However, this has not yet been the case for historical fonts.

"One of the biggest problems was typography," says Reul. One of the reasons for this is that the first printers of the 15th century did not use uniform fonts. "Their printing stamps were all carved by themselves, each printing house practically had its own letters."

Error rates below one percent

Whether "e" or "c," whether "v" or "r"—it is often not easy to distinguish in old prints, but software can learn to recognize such subtleties. To do so, it has to be trained on sample material. In his work, Reul has developed methods to make training more efficient. In a case study with six historical prints from the years 1476 to 1572, the average error rate in automatic text recognition was reduced from 3.9 to 1.7 percent.

Not only was the methodology was improved, JMU computer scientist Christoph Wick has also decisively further refined the technical component by developing the Calamari OCR tool, which is also freely available and has since been fully integrated into OCR4all, promising even better results. Now, even for the oldest printed works, error rates of less than one percent can be achieved in general.

Lexical projects

Reul has also convinced external partners of the quality of Würzburg's OCR research. In cooperation with the "Zentrum für digitale Lexikographie der deutschen Sprache" (Berlin), Daniel Sanders' "Wörterbuch der deutschen Sprache" (Dictionary of the German Language) has been digitally indexed, and a scientific publication on this work is currently being prepared. The various lines of this text often contain different fonts, representing different semantic information. Here, the existing approach to character recognition was extended in such a way that not only the text but also the typography and thus the complex content structure of the lexicon may be reproduced very precisely.

The computer scientist from Würzburg will soon complete his doctoral thesis, but he is also willing to continue working with OCR in the future: "The computer science behind OCR is extremely exciting," he says. A possible project in the near future: the creators of the "Idiotikon," a dictionary of the Swiss-German language, have indicated their interest in collaboration since they might well need the Würzburg's specialist knowledge.

More information: github.com/OCR4all
github.com/Calamari-OCR

jlcl.org/content/2-allissues/1 … 18/jlcl_2018-1_1.pdf
jlcl.org/content/2-allissues/1 … 18/jlcl_2018-1_4.pdf

Provided by University of Würzburg

Citation: OCR4all: Modern tool for old texts (2019, April 24) retrieved 3 July 2024 from https://phys.org/news/2019-04-ocr4all-modern-tool-texts.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Why language technology can't handle Game of Thrones (yet)

9 shares

Feedback to editors

Scientists discover way to 'grow' sub-nanometer sized transistors

16 minutes ago

Scientists pinpoint strategies that could stop cats from scratching your furniture

5 hours ago

Two new species of Psilocybe mushrooms discovered in southern Africa

12 hours ago

UV radiation damage leads to ribosome roadblocks, causing early skin cell death

13 hours ago

Dual-laser approach could lower cost of high-resolution 3D printing

13 hours ago

Novel method enhances size-controlled production of luminescent quantum dots

13 hours ago

Cosmic simulation reveals how black holes grow and evolve

14 hours ago

How climate change is affecting where species live

15 hours ago

Human presence shifts balance between leopards and hyenas in East Africa

15 hours ago

Physicists' laser experiment excites atom's nucleus, may enable new type of atomic clock

15 hours ago

Load comments (0)

OCR4all: Modern tool for old texts

Developed in cooperation with the humanities

Each print shop had its own font

Error rates below one percent

Lexical projects

Scientists discover way to 'grow' sub-nanometer sized transistors

Scientists pinpoint strategies that could stop cats from scratching your furniture

Two new species of Psilocybe mushrooms discovered in southern Africa

UV radiation damage leads to ribosome roadblocks, causing early skin cell death

Dual-laser approach could lower cost of high-resolution 3D printing

Novel method enhances size-controlled production of luminescent quantum dots

Cosmic simulation reveals how black holes grow and evolve

How climate change is affecting where species live

Human presence shifts balance between leopards and hyenas in East Africa

Physicists' laser experiment excites atom's nucleus, may enable new type of atomic clock

Relevant PhysicsForums posts

Number of Multiplications in the FFT Algorithm

Newbie question about deep learning

Who can find the largest prime number with their own programmed code?

Math Major Trying to Learn CS

Parallelizing N-Queens

How to test locally hosted websites on mobile?

Why language technology can't handle Game of Thrones (yet)

New open access database for medieval literature

A Google for handwriting

'In 50 years, reading will be much easier—for computers and humans alike'

Researchers hide information in plain text

What happens when data scientists crunch through three centuries of Robinson Crusoe?

Machine learning approach for low-dose CT imaging yields superior results

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Team breaks world record for fast, accurate AI training

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Medical Xpress

Tech Xplore

Science X

OCR4all: Modern tool for old texts

Developed in cooperation with the humanities

Each print shop had its own font

Error rates below one percent

Lexical projects

Scientists discover way to 'grow' sub-nanometer sized transistors

Scientists pinpoint strategies that could stop cats from scratching your furniture

Two new species of Psilocybe mushrooms discovered in southern Africa

UV radiation damage leads to ribosome roadblocks, causing early skin cell death

Dual-laser approach could lower cost of high-resolution 3D printing

Novel method enhances size-controlled production of luminescent quantum dots

Cosmic simulation reveals how black holes grow and evolve

How climate change is affecting where species live

Human presence shifts balance between leopards and hyenas in East Africa

Physicists' laser experiment excites atom's nucleus, may enable new type of atomic clock

Relevant PhysicsForums posts

Related Stories

Why language technology can't handle Game of Thrones (yet)

New open access database for medieval literature

A Google for handwriting

'In 50 years, reading will be much easier—for computers and humans alike'

Researchers hide information in plain text

What happens when data scientists crunch through three centuries of Robinson Crusoe?

Recommended for you

Machine learning approach for low-dose CT imaging yields superior results

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Team breaks world record for fast, accurate AI training

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Newsletter sign up

Donate and enjoy an ad-free experience