Research team develops systems that process and understand spoken language, especially Basque

March 10, 2008

A research team drawn from the Department of Systems and Automation Engineering of the Polytechnic University School and from the Faculty of Informatics at the Donostia-San Sebastián campus of the University of the Basque Country (UPV/EHU) and led by lecturer Miren Karmele Lopez de Ipiña, is developing systems that process and understand spoken language and automatically obtain information particularly from Basque radio and television.

Carrying out a search in the net for written documents is an easy task – the word is simply introduced in to the search tool. Nevertheless, these searches do not work with the spoken word or with audio archives, unless these have an accompanying written explanation.

Recognising spoken language and converting it into text is not easy. The words cannot be easily distinguished from each other, intonation has to be taken into consideration and, besides, physical signal noise is also an obstacle. Because of all this, there is a huge market in systems that process and understand spoken language, i.e. systems that convert it into written text. Such systems are integrated, mainly, into telephone services such as prior appointment, requests for products, bookings for performances, etc. In any case, there are also other devices, for example, automatic dictation i.e. systems that convert oral text to written on the spot. It is in this latter aspect that the research team at the Department of Systems and Automation Engineering at the UPV/EHU is focusing.

For the spoken process, the system has to be very well practised, i.e. it has to be taught with a training programme known as machine-study. First, television or radio audio files are needed and it is also necessary to have certain reference texts from the mentioned media. The research team at the UPV/EHU, for example, frequently use files from the Gaur Egun and Teleberri slots (Basque Television news, in Basque and Spanish respectively) in order to programme/train the system. It is not necessary to know what is being said word for word; the system has to be able to carry out a resume of what is heard. At the end, the system seeks to comprehend the relation between the words and the sounds.

Once terminated the training/learning process, the system should be capable of understanding what is heard in any programme of Gaur Egun or Teleberri. Although the learning process is very lengthy, once the system interiorises the rules or the information, i.e. suitable reference material, the result is obtained rapidly - in this case, written text from spoken.

Small and big

In reality, the majority of applications of this type on the market are aimed at the “big” languages, above all English. In any case, the research team at the Polytechnic University School in Donostia-San Sebastián, together with the IXA team, GTTS and the Computational Intelligence team from UPV/EHU, are working with the Basque language - Euskera. The main difference between “small” languages and “big” ones is the number of reference data. These types of systems for English have an impressive amount of data while reference material for Basque, on the other hand, is considerably less. Given all this, the research team is focusing on developing new techniques to take better advantage of these minimum data and to use them with greater precision.

In order to obtain greater precision, mathematic equations are used. What is involved is the location of the most important characteristics that provide suitable information for the audio files. It is not easy to carry out this selection, distinguishing suitable data from unsuitable information. Normally, the UPV/EHU research team takes frequency and intonation into consideration in order to classify all the information gathered (for example, to differentiate a question from a statement, etc).

These systems depend a lot on the language and each language has its own system. The UPV/EHU research is not only working with Euskera, but also with Spanish and French. When studying the Teleberri and Infozazpi programmes, amongst others, they have two goals: on the one hand, comprehend Spanish and French — as well as Basque — and, on the other, detect the similarities within these systems between Euskera and the other two languages, in order to train the systems in Basque even more.

As regards this, the UPV/EHU research team is currently undertaking trials to develop a system that is valid for more than one language. This is the precisely the challenge for the future: to develop a system that is capable of understanding Basque, Spanish and French.

Source: Elhuyar Fundazioa

Explore further: Waste water treatment plants fail to completely eliminate new chemical compounds

Related Stories

The most advanced quantum algorithm known

October 22, 2015

An international research collaborative has published a paper in the prestigious journal Nature Communications titled "Digital quantum simulation of fermionic models with a superconducting circuit." The paper reports on the ...

A quantum simulator of impossible physics

October 8, 2015

The research group Quantum Technologies for Information Science (QUTIS) of the UPV/EHU-University of the Basque Countr has created a quantum simulator that is capable of creating unphysical phenomena in the atomic world—in ...

Frogs resolve computing issues

October 7, 2015

When male Japanese tree frogs sing at the same time, the females cannot differentiate between them in order to choose the best one. Therefore, the would-be suitors have come to an agreement and sing one by one. This natural ...

More male fish "feminized" by pollution on the Basque coast

March 28, 2014

The UPV/EHU's Cell Biology in Environmental Toxicology group has conducted research using thick-lipped grey mullet and has analysed specimens in six zones: Arriluze and Gernika in 2007 and 2008, and since then, Santurtzi, ...

Recommended for you

The ethics of robot love

November 25, 2015

There was to have been a conference in Malaysia last week called Love and Sex with Robots but it was cancelled. Malaysian police branded it "illegal" and "ridiculous". "There is nothing scientific about sex with robots," ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.