Microsoft wins applause for tone-preserving translation (w/ Video)

Nov 10, 2012 by Nancy Owano report

(Phys.org)—Speech recognition in computers is an ongoing story with years of little progress in between. Even such programs as Siri have inspired derisive tales of how Siri renders flubs. Microsoft Chief Research Officer Rick Rashid recently presented an overview of where speech recognition at Microsoft stands today. His talk, delivered in October at the Tianjin, China at Microsoft Research Asia's 21st Century Computing, has captured the attention of technology watchers globally, as it makes the point that progress really is on a roll. Rashid made it clear, through his summary timeline of milestones and direct demo of text to spech capabilities, that the newer signs of progress are substantial and impressive.

Following the overview, he said he wanted to address the audience in Chinese, using a text to speech system. He showed "how we take the text that represents my speech and run it through translation.-It required a text to speech system that Microsoft researchers built using a few hours' speech of a native Chinese speaker and properties of my own voice taken from about one hour of prerecorded (English) data, in this case recordings of previous speeches I'd made." The speech synthesis software that was put to use was able to preserve his very own cadence. The audience expressed delighted applause to see how much the translated speech still sounded like the voice of the original speaker. Rashid's words were almost instantly turned into Chinese, via the , maintaining his speaking style.

This video is not supported by your browser at this time.

In brief, the demo indicates that the technology world has taken a three-step turn where (1) spoken English can undergo machine translation and (2) spoken back in another language, with (3) the second- retaining the speaker's cadence and tone.

This caps the last 60 years or so, where have been working to build systems that can understand what a person says when they talk. The reason why scientists found it tough going at first was because of the imperfect approach used, as simple pattern matching. The computer would examine the waveforms produced by human speech and try to match them to waveforms associated with particular words. Everyone's voice is different, however, and even the same person can say the same word in different ways.

Another milestone came in the late 1970s, with researchers at Carnegie Mellon focusing on speech recognition using a technique that could make use of training data from many speakers to build statistical speech models. Over the years that followed, speech systems advanced more and more, thanks in part to faster computers and the ability to process more data.

Just over two years ago, he continued, researchers at Microsoft Research and the University of Toronto reported a speech-recognition breakthrough. They were utilizing the Deep Neural Networks technique, patterned after the behavior of the human brain, recognizing sound the way the brain does. The result has been better recognition rates.

As for of text, capabilities have improved for translating web pages from one language to another. In Rashid's demo, he said words in English, sent through the translator system, and his words were played in Chinese. There were two steps put in play. "The first takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part," he said. "The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages."

Rashid said results are still not perfect. Much work remains but the technology is promising enough to raise hopes that systems to break down language barriers are years, not centuries, off.

Rashid is not the first, however, to showcase instant translation technologies. Earlier this year, Microsoft Chief Research and Strategy Officer Craig Mundie captured imaginations of the audience at TechFest 2012, when he presented a bilingual talking head. Called "Monolingual TTS," the Microsoft software at play similarly was able to translate the user's speech into another language and in a voice that sounded like the original user's.

The tool involved , followed by translation, followed by a final text-to-speech output in a different language. The demo used an avatar of Mundie. A synthetic version of Mundie's voice, in English, welcomed the audience to Microsoft Research. Then the voice shifted to the same phrase in Mandarin. The words in Mandarin were reported to be recognizably Mundie's voice. Mundie said the dream was to be able to sit in an office and send an avatar to meet somebody in Beijing, speaking in English while the avatar speaks in Mandarin, realtime. "We want the computer to be a simultaneous translator."

Explore further: Coping with floods—of water and data

More information: blogs.technet.com/b/next/archi… gy.aspx#.UJ7uVs3Aerh

Related Stories

Bilingual avatar speaks Mundie language

Mar 10, 2012

(PhysOrg.com) -- This week's Microsoft Big Idea event, TechFest 2012, presented the latest advances on the part of researchers at Microsoft. A bilingual talking head received much of the attention. Called ...

Google developing a translator for smartphones

Feb 09, 2010

(PhysOrg.com) -- Google is developing a translator for its Android smartphones that aims to almost instantly translate from one spoken language to another during phone calls.

MSI shows voice-controlled motherboard approach at IDF

Sep 19, 2011

(PhysOrg.com) -- Micro-Star International (MSI) revealed voice control via motherboard at the Intel Developer Conference. The MSI demo showed how its add-on PCIe x1 card can add voice control to selected Sandy Bridge motherboards ...

Recommended for you

Coping with floods—of water and data

Dec 19, 2014

Halloween 2013 brought real terror to an Austin, Texas, neighborhood, when a flash flood killed four residents and damaged roughly 1,200 homes. Following torrential rains, Onion Creek swept over its banks and inundated the ...

Cloud computing helps make sense of cloud forests

Dec 17, 2014

The forests that surround Campos do Jordao are among the foggiest places on Earth. With a canopy shrouded in mist much of time, these are the renowned cloud forests of the Brazilian state of São Paulo. It is here that researchers ...

User comments : 5

Adjust slider to filter visible comments by rank

Display comments: newest first

DGBEACH
3.5 / 5 (2) Nov 11, 2012
While I can see the possible uses, I fear that it will make us even lazier as a species, as did the calculator. There is much to be said for actually learning a language. It is a sign of respect towards the one to whom we are speaking. I'd hate to think that my children will only be able to communicate in other languages once they have paid their Microsoft license fee!
Sonhouse
3.7 / 5 (3) Nov 11, 2012
Maybe you think it is laziness but it would allow speech in many languages, far more than the most ambitious linguist could ever hope to do. It will probably become like the sci fi translators that basically fit in your ear like a hearing aid the size of a pencil eraser.

It can also be a language training aid so don't poo poo this development because you think people will be lazy.

It is already getting lazy to not have to read maps now either, with GPS. Do you rail against that also? Or using calculators?

Personally I am happy to have a clerk punch in numbers to give me the correct change every time when I can see the results on the clerk's computer screen so I don't have to worry he can't add up right.
davidkajones
not rated yet Nov 15, 2012
If your audio quality is excellent, you can use the software to recognize the voice .Automatic voice recognition technology is not still matured to produce accurate transcription for non-American accents, or with people speaking quickly or multiple speakers.If your audio quality is OK or verbatim audio better to go for manual transcription. For multiple speakers audio files, better to go for manual audio transcription.
http://synergytranscriptionservices.com/Audio-Transcription.aspx
VendicarD
5 / 5 (1) Nov 15, 2012
Sounds like a regular voice synthesis program to me. Nothing like the speaker.
Egleton
1 / 5 (1) Nov 17, 2012
Good. First the speed of communications goes to zero, the world goes to bitcoin and now the language barrier disappears. One world government is inevitable.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.