September 8, 2015

Linguists use the Bible to develop language technology for small languages

If you speak English or another big language, you can talk to your mobile phone, use search engines, and get machine translation systems to do your translations for you. This has been made possible because English is a huge language with a great number of resources that linguists employ to develop language technology. People who speak Faroese, Welsh or Galician are less fortunate.

"When we develop machine translation systems and search engines, we usually feed huge amounts of manually annotated texts that contain information about the function and meaning of individual words into a computer. For historical reasons, these texts have primarily been newspaper articles in English and other big languages. We do not have access to similarly annotated texts in smaller languages like Faroese, Welsh, Galician and Irish, or even a major African language like Yoruba which is spoken by 28 million people," says Professor Anders Søgaard from the University of Copenhagen.

Anders Søgaard and his colleagues from the project LOWLANDS: Parsing Low-Resource Languages and Domains are utilising the texts which were annotated for big languages to develop language technology for smaller languages, the key to which is to find translated texts so that the researchers can transfer knowledge of one language's grammar onto another language:

"The Bible has been translated into more than 1,500 languages, even the smallest and most 'exotic' ones, and the translations are extremely conservative; the verses have a completely uniform structure across the many different languages which means that we can make suitable computer models of even very small languages where we only have a couple of hundred pages of biblical text," Anders Søgaard says and elaborates:

"We teach the machines to register what is translated with what in the different translations of biblical texts, which makes it possible to find so many similarities between the annotated and unannotated texts that we can produce exact computer models of 100 different languages - languages such as Swahili, Wolof and Xhosa that are spoken in Nigeria. And we have made these models available for other developers and researchers. This means that we will be able to develop language technology resources for these languages similar to those which speakers of languages such as English and French have."

Anders Søgaard and his colleagues have recently presented their results in the article '"If you all you have is a bit of the Bible' at the prestigious conference Annual Meeting of the Association of Computational Linguistics.

Wikipedia as universal dictionary

The user-driven online encyclopaedia Wikipedia has also proved to be a highly useful source for the researchers who use its texts to develop language resources for languages where people do not have access to the new language technologies. Wikipedia contains over 35 million articles, but it is the fact that as many as 129 languages are represented by more than 10,000 articles each that the researchers find interesting as many articles concern the same concepts and topics.

"This allows us to do what we call 'inverted indexing' which means that we use the concept that the Wikipedia articles is about to describe the words used in the articles on the concept in different languages. We usually use the words to describe the concept but here we do it in reverse order," Anders Søgaard explains and continues:

"If the English word 'glasses' appears in the English Wikipedia entry on Harry Potter, and the German word 'Brille' is used in the equivalent German entry, it is very likely that the two words will be represented in a similar fashion in our models which form the basis of e.g. machine translation systems. And the advantage of this model is that it can be applied to 100 different languages at the same time, including many languages that have previously been denied the language technology resources that we use every day."

The method is described in the article 'Inverted indexing for cross-lingual NLP' which Anders Søgaard wrote together with researchers from Google London. The article was also presented at the Annual Meeting of the Association of Computational Linguistics.

More information: Annual Meeting of the Association of Computational Linguistics: aclweb.org/anthology/P15-2044

Provided by University of Copenhagen

Citation: Linguists use the Bible to develop language technology for small languages (2015, September 8) retrieved 10 May 2024 from https://phys.org/news/2015-09-linguists-bible-language-technology-small.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Google adds 20 languages to instant translation app

70 shares

Feedback to editors

Scientists unlock key to breeding 'carbon gobbling' plants with a major appetite

8 hours ago

Clues from deep magma reservoirs could improve volcanic eruption forecasts

8 hours ago

Study shows AI conversational agents can help reduce interethnic prejudice during online interactions

9 hours ago

NASA's Chandra notices the galactic center is venting

9 hours ago

Wildfires in old-growth Amazon forest areas rose 152% in 2023, study shows

9 hours ago

GoT-ChA: New tool reveals how gene mutations affect cells

10 hours ago

Accelerating material characterization: Machine learning meets X-ray absorption spectroscopy

10 hours ago

Life expectancy study reveals longest and shortest-lived cats

10 hours ago

New research shows microevolution can be used to predict how evolution works on much longer timescales

10 hours ago

Stable magnetic bundles achieved at room temperature and zero magnetic field

10 hours ago

Load comments (1)

Linguists use the Bible to develop language technology for small languages

Wikipedia as universal dictionary

Scientists unlock key to breeding 'carbon gobbling' plants with a major appetite

Clues from deep magma reservoirs could improve volcanic eruption forecasts

Study shows AI conversational agents can help reduce interethnic prejudice during online interactions

NASA's Chandra notices the galactic center is venting

Wildfires in old-growth Amazon forest areas rose 152% in 2023, study shows

GoT-ChA: New tool reveals how gene mutations affect cells

Accelerating material characterization: Machine learning meets X-ray absorption spectroscopy

Life expectancy study reveals longest and shortest-lived cats

New research shows microevolution can be used to predict how evolution works on much longer timescales

Stable magnetic bundles achieved at room temperature and zero magnetic field

Relevant PhysicsForums posts

Music to Lift Your Soul: 4 Genres & Honorable Mention

Cover songs versus the original track, which ones are better?

How does academic transcripts translation work?

Definition of Maoil

Etymology of a Curse Word

I was wondering how English letters are standardized

Google adds 20 languages to instant translation app

Grammar: Eventually the brain opts for the easy route

Talking book gives new voice to Indigenous languages

Researchers working to document Indigenous languages

How language gives your brain a break

Languages of medical residency applicants compared to patients with limited English

Study shows AI conversational agents can help reduce interethnic prejudice during online interactions

Analysis of millions of posts shows that users seek out echo chambers on social media

The spread of misinformation varies by topic and by country in Europe, study finds

New study is first to use statistical physics to corroborate 1940s social balance theory

Historical data suggest hard knocks to human societies build long-term resilience

Targeting friends to induce social contagion can benefit the world, says new research

Medical Xpress

Tech Xplore

Science X

Linguists use the Bible to develop language technology for small languages

Wikipedia as universal dictionary

Scientists unlock key to breeding 'carbon gobbling' plants with a major appetite

Clues from deep magma reservoirs could improve volcanic eruption forecasts

Study shows AI conversational agents can help reduce interethnic prejudice during online interactions

NASA's Chandra notices the galactic center is venting

Wildfires in old-growth Amazon forest areas rose 152% in 2023, study shows

GoT-ChA: New tool reveals how gene mutations affect cells

Accelerating material characterization: Machine learning meets X-ray absorption spectroscopy

Life expectancy study reveals longest and shortest-lived cats

New research shows microevolution can be used to predict how evolution works on much longer timescales

Stable magnetic bundles achieved at room temperature and zero magnetic field

Relevant PhysicsForums posts

Related Stories

Google adds 20 languages to instant translation app

Grammar: Eventually the brain opts for the easy route

Talking book gives new voice to Indigenous languages

Researchers working to document Indigenous languages

How language gives your brain a break

Languages of medical residency applicants compared to patients with limited English

Recommended for you

Study shows AI conversational agents can help reduce interethnic prejudice during online interactions

Analysis of millions of posts shows that users seek out echo chambers on social media

The spread of misinformation varies by topic and by country in Europe, study finds

New study is first to use statistical physics to corroborate 1940s social balance theory

Historical data suggest hard knocks to human societies build long-term resilience

Targeting friends to induce social contagion can benefit the world, says new research

Newsletter sign up

Donate and enjoy an ad-free experience