Research team develops systems that process and understand spoken language, especially Basque

Mar 10, 2008

A research team drawn from the Department of Systems and Automation Engineering of the Polytechnic University School and from the Faculty of Informatics at the Donostia-San Sebastián campus of the University of the Basque Country (UPV/EHU) and led by lecturer Miren Karmele Lopez de Ipiña, is developing systems that process and understand spoken language and automatically obtain information particularly from Basque radio and television.

Carrying out a search in the net for written documents is an easy task – the word is simply introduced in to the search tool. Nevertheless, these searches do not work with the spoken word or with audio archives, unless these have an accompanying written explanation.

Recognising spoken language and converting it into text is not easy. The words cannot be easily distinguished from each other, intonation has to be taken into consideration and, besides, physical signal noise is also an obstacle. Because of all this, there is a huge market in systems that process and understand spoken language, i.e. systems that convert it into written text. Such systems are integrated, mainly, into telephone services such as prior appointment, requests for products, bookings for performances, etc. In any case, there are also other devices, for example, automatic dictation i.e. systems that convert oral text to written on the spot. It is in this latter aspect that the research team at the Department of Systems and Automation Engineering at the UPV/EHU is focusing.

For the spoken process, the system has to be very well practised, i.e. it has to be taught with a training programme known as machine-study. First, television or radio audio files are needed and it is also necessary to have certain reference texts from the mentioned media. The research team at the UPV/EHU, for example, frequently use files from the Gaur Egun and Teleberri slots (Basque Television news, in Basque and Spanish respectively) in order to programme/train the system. It is not necessary to know what is being said word for word; the system has to be able to carry out a resume of what is heard. At the end, the system seeks to comprehend the relation between the words and the sounds.

Once terminated the training/learning process, the system should be capable of understanding what is heard in any programme of Gaur Egun or Teleberri. Although the learning process is very lengthy, once the system interiorises the rules or the information, i.e. suitable reference material, the result is obtained rapidly - in this case, written text from spoken.

Small and big

In reality, the majority of applications of this type on the market are aimed at the “big” languages, above all English. In any case, the research team at the Polytechnic University School in Donostia-San Sebastián, together with the IXA team, GTTS and the Computational Intelligence team from UPV/EHU, are working with the Basque language - Euskera. The main difference between “small” languages and “big” ones is the number of reference data. These types of systems for English have an impressive amount of data while reference material for Basque, on the other hand, is considerably less. Given all this, the research team is focusing on developing new techniques to take better advantage of these minimum data and to use them with greater precision.

In order to obtain greater precision, mathematic equations are used. What is involved is the location of the most important characteristics that provide suitable information for the audio files. It is not easy to carry out this selection, distinguishing suitable data from unsuitable information. Normally, the UPV/EHU research team takes frequency and intonation into consideration in order to classify all the information gathered (for example, to differentiate a question from a statement, etc).

These systems depend a lot on the language and each language has its own system. The UPV/EHU research is not only working with Euskera, but also with Spanish and French. When studying the Teleberri and Infozazpi programmes, amongst others, they have two goals: on the one hand, comprehend Spanish and French — as well as Basque — and, on the other, detect the similarities within these systems between Euskera and the other two languages, in order to train the systems in Basque even more.

As regards this, the UPV/EHU research team is currently undertaking trials to develop a system that is valid for more than one language. This is the precisely the challenge for the future: to develop a system that is capable of understanding Basque, Spanish and French.

Source: Elhuyar Fundazioa

Explore further: Spanish scientists create algorithms to measure sentiment on social networks

add to favorites email to friend print save as pdf

Related Stories

Tiny carbon nanotube pores make big impact

Oct 29, 2014

A team led by the Lawrence Livermore scientists has created a new kind of ion channel based on short carbon nanotubes, which can be inserted into synthetic bilayers and live cell membranes to form tiny pores ...

More male fish "feminized" by pollution on the Basque coast

Mar 28, 2014

The UPV/EHU's Cell Biology in Environmental Toxicology group has conducted research using thick-lipped grey mullet and has analysed specimens in six zones: Arriluze and Gernika in 2007 and 2008, and since then, Santurtzi, ...

Recommended for you

Congress likely to make key decisions on drones

37 minutes ago

The Obama administration is on the verge of proposing long-awaited rules for commercial drone operations in U.S. skies, but key decisions on how much access to grant drones are likely to come from Congress ...

N. Korea suffers another Internet shutdown

22 hours ago

North Korea suffered an Internet shutdown for at least two hours on Saturday, Chinese state-media and cyber experts said, after Pyongyang blamed Washington for an online blackout earlier this week.

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.