Research team develops systems that process and understand spoken language, especially Basque

March 10, 2008

A research team drawn from the Department of Systems and Automation Engineering of the Polytechnic University School and from the Faculty of Informatics at the Donostia-San Sebastián campus of the University of the Basque Country (UPV/EHU) and led by lecturer Miren Karmele Lopez de Ipiña, is developing systems that process and understand spoken language and automatically obtain information particularly from Basque radio and television.

Carrying out a search in the net for written documents is an easy task – the word is simply introduced in to the search tool. Nevertheless, these searches do not work with the spoken word or with audio archives, unless these have an accompanying written explanation.

Recognising spoken language and converting it into text is not easy. The words cannot be easily distinguished from each other, intonation has to be taken into consideration and, besides, physical signal noise is also an obstacle. Because of all this, there is a huge market in systems that process and understand spoken language, i.e. systems that convert it into written text. Such systems are integrated, mainly, into telephone services such as prior appointment, requests for products, bookings for performances, etc. In any case, there are also other devices, for example, automatic dictation i.e. systems that convert oral text to written on the spot. It is in this latter aspect that the research team at the Department of Systems and Automation Engineering at the UPV/EHU is focusing.

For the spoken process, the system has to be very well practised, i.e. it has to be taught with a training programme known as machine-study. First, television or radio audio files are needed and it is also necessary to have certain reference texts from the mentioned media. The research team at the UPV/EHU, for example, frequently use files from the Gaur Egun and Teleberri slots (Basque Television news, in Basque and Spanish respectively) in order to programme/train the system. It is not necessary to know what is being said word for word; the system has to be able to carry out a resume of what is heard. At the end, the system seeks to comprehend the relation between the words and the sounds.

Once terminated the training/learning process, the system should be capable of understanding what is heard in any programme of Gaur Egun or Teleberri. Although the learning process is very lengthy, once the system interiorises the rules or the information, i.e. suitable reference material, the result is obtained rapidly - in this case, written text from spoken.

Small and big

In reality, the majority of applications of this type on the market are aimed at the “big” languages, above all English. In any case, the research team at the Polytechnic University School in Donostia-San Sebastián, together with the IXA team, GTTS and the Computational Intelligence team from UPV/EHU, are working with the Basque language - Euskera. The main difference between “small” languages and “big” ones is the number of reference data. These types of systems for English have an impressive amount of data while reference material for Basque, on the other hand, is considerably less. Given all this, the research team is focusing on developing new techniques to take better advantage of these minimum data and to use them with greater precision.

In order to obtain greater precision, mathematic equations are used. What is involved is the location of the most important characteristics that provide suitable information for the audio files. It is not easy to carry out this selection, distinguishing suitable data from unsuitable information. Normally, the UPV/EHU research team takes frequency and intonation into consideration in order to classify all the information gathered (for example, to differentiate a question from a statement, etc).

These systems depend a lot on the language and each language has its own system. The UPV/EHU research is not only working with Euskera, but also with Spanish and French. When studying the Teleberri and Infozazpi programmes, amongst others, they have two goals: on the one hand, comprehend Spanish and French — as well as Basque — and, on the other, detect the similarities within these systems between Euskera and the other two languages, in order to train the systems in Basque even more.

As regards this, the UPV/EHU research team is currently undertaking trials to develop a system that is valid for more than one language. This is the precisely the challenge for the future: to develop a system that is capable of understanding Basque, Spanish and French.

Source: Elhuyar Fundazioa

Explore further: Best of Last Week – A blow for supersymmetry, a saltwater lamp and sleep found to make memories more accessible

Related Stories

A new step towards using graphene in electronic applications

January 14, 2015

A team of the University of Berkeley and the Centre for Materials Physics (CSIC-UPV/EHU) has managed, with atomic precision, to create nanostructures combining graphene ribbons of varying widths. The work is being published ...

Tiny carbon nanotube pores make big impact

October 29, 2014

A team led by the Lawrence Livermore scientists has created a new kind of ion channel based on short carbon nanotubes, which can be inserted into synthetic bilayers and live cell membranes to form tiny pores that transport ...

Recommended for you

How to curb emissions? Put a price on carbon

September 3, 2015

Literally putting a price on carbon pollution and other greenhouse gasses is the best approach for nurturing the rapid growth of renewable energy and reducing emissions.

Customizing 3-D printing

September 3, 2015

The technology behind 3-D printing is growing more and more common, but the ability to create designs for it is not. Any but the simplest designs require expertise with computer-aided design (CAD) applications, and even for ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.