Bilingual avatar speaks Mundie language

March 10, 2012 by Nancy Owano, report

( -- This week's Microsoft Big Idea event, TechFest 2012, presented the latest advances on the part of researchers at Microsoft. A bilingual talking head received much of the attention. Called "Monolingual TTS," the Microsoft research effort involves software that can translate the user’s speech into another language and in a voice that sounds like the original user’s. As Microsoft explains, with the use of a speaker’s monolingual recording, the system's algorithm can render speech sentences in different languages for building "mixed coded bilingual text to speech (TTS) systems."

According to the team, “We have recordings of 26 languages which are used to build our TTS of corresponding languages. By using the new approach, we can synthesize any mixed language pair out of the 26 languages.”

The software does this by first “learning” what the user’s voice sounds like. The tool works by using speech recognition, followed by translation, followed by the final output in a different language. The demo at this week used an avatar of Craig Mundie, Microsoft's chief research and strategy officer, to illustrate the system in action.

A synthetic version of Mundie's voice, in English, welcomed the audience to Microsoft Research. Then the voice shifted to the same phrase in Mandarin. The words in Mandarin were reported to be recognizably Mundie’s voice.

Craig Mundie's talking head speaks in English.

Craig Mundie's talking head speaks in Chinese.
Some obvious applications might be in a wide range of service-related activities, from the hospitality and tourism market sectors to government workers making use of the software with communities at home and in their international travels.

"We will be able to do quite a few scenario applications," said Frank Soong, who is a principal researcher in Microsoft’s speech group. Soong helped create the system with his colleagues at Microsoft’s research lab in Beijing.

Microsoft, meanwhile, has had a vision for a while about virtual avatars being used along with this kind of technology. The vision is one where avatars not only look like their users, with photo-realistic effects, but can also successfully mimic their users’ voices and approximate their lip movements to put speech translation into instant, and personalized, action.

Last year, Mundie was on hand at the Microsoft Research Asia facility in Beijing, where he said that the coming-together of touch, vision, synthesis and recognition, will be an important advancement.

“Another dream we have is that I should be able to sit in my office, send my avatar to meet somebody in Beijing, and I can speak in English and the avatar speaks in Mandarin in real-time," he said. "We want the computer to be a simultaneous translator."

Explore further: MSI shows voice-controlled motherboard approach at IDF

More information: … o-real_talking_head/
via Technology Review

Related Stories

MSI shows voice-controlled motherboard approach at IDF

September 19, 2011

( -- Micro-Star International (MSI) revealed voice control via motherboard at the Intel Developer Conference. The MSI demo showed how its add-on PCIe x1 card can add voice control to selected Sandy Bridge motherboards ...

Google developing a translator for smartphones

February 9, 2010

( -- Google is developing a translator for its Android smartphones that aims to almost instantly translate from one spoken language to another during phone calls.

Apple seeks patents for display and noise-out systems

December 11, 2011

( -- Apple made patent news this week in two directions, toward a Kinect like system and toward a quest for excellence in sound quality on phones. It’s been reported that Apple has filed patent applications ...

Recommended for you

World's biggest battery in Australia to trump Musk's

March 16, 2018

British billionaire businessman Sanjeev Gupta will built the world's biggest battery in South Australia, officials said Friday, overtaking US star entrepreneur Elon Musk's project in the same state last year.

1 in 3 Michigan workers tested opened fake 'phishing' email

March 16, 2018

Michigan auditors who conducted a fake "phishing" attack on 5,000 randomly selected state employees said Friday that nearly one-third opened the email, a quarter clicked on the link and almost one-fifth entered their user ...

Origami-inspired self-locking foldable robotic arm

March 15, 2018

A research team of Seoul National University led by Professor Kyu-Jin Cho has developed an origami-inspired robotic arm that is foldable, self-assembling and also highly-rigid. (The researchers include Suk-Jun Kim, Dae-Young ...


Adjust slider to filter visible comments by rank

Display comments: newest first

5 / 5 (1) Mar 10, 2012
Really Microsoft, your behind in times. Valve's had the facial morph tech that approximates visual emotion combined with lip syncing since HL2 was released. Not only that it's free for anyone that bought the game to play with to their heart's content. The only advanced I see there is applying Microsoft's horrible SAPI to synthesize the voice. The language choice is no coincidence either, hides a lot (though not all) of the shortcomings of Microsoft SAPI.
5 / 5 (1) Mar 10, 2012
Nothing new from MS, wait until someone invents something and re-invent a bad version of it.
not rated yet Mar 10, 2012
wow. his mandarin is great. i wonder how accurate it is with greater amounts of language.

these virtual avatars could become lifelong friends and life coaches. Just think of facebook and netflix connected with iphone apps and other smart devices, and with programming to help you find what you want whether it be multimedia content, study materials, news, etc. it can also identify unhealthy habits and counsel you in a way that is effective to your preferences: (computer sees the user is feeling upset, plays beethoven as per user's preferences and says: "bob, today for lunch, i have a plan. eat a banana, a smoked turkey sandwich with mustard and a slice of tomato, and a glass of milk.") anyway, i am just saying it can help monitor people's behavior and provide them with things to make their life, career and love more effective, and do so in a way that feels natural to the person.
not rated yet Mar 10, 2012
avatars? really? You think when I conduct a business meeting I'm gonna want to talk to someones avatar? This stuff looks great in movies but in the real world it's unpractical. What we need is something on the lines of a babel fish. The best way to market this would be through smartphone apps. Record the speaker,translate, then playback through a bluetooth. If successful then maybe you can consider avatars. left foot, right foot, left foot.
1 / 5 (1) Mar 11, 2012
this is good invention as long as nothing gets lost in translation. definitions must be precise and grammar in both or all languages, otherwise misunderstandings may happen that could prevent good partnerships or business.
not rated yet Mar 11, 2012
Think of the bandwidth savings if all you needed to send was the text!
not rated yet Mar 11, 2012
Wow. Now I'll have someone to talk to while my Google car drives itself to work. Oh, wait, that's what cell phones are for, I guess.

@Sanescience: Made me laugh. I was just thinking a day or two ago that the old 300 baud bandwidth admonishments had died along with the BBS world. If only we all had seen the future of streaming video....
wow. his mandarin is great. i wonder how accurate it is with greater amounts of language.

@ bredmond: Greater numbers, bredmond, greater numbers of languages, not greater amounts. Since this is an article about languages, I'll take the opportunity to nit pick on usage where I wouldn't usually do so (just kidding, rest assured). Now, everybody, feel free to pick out the usage errors in *my* post. There are at least two, I think.
5 / 5 (1) Mar 11, 2012
I dont get why you dont think this is not a good invention.. being able to talk in real time with clients of any language would be amazing.
5 / 5 (1) Mar 11, 2012
Interesting that the mandarin version sounded more natural than the english.

The english version sounded more like Stephan Hawking still.
not rated yet Mar 11, 2012
Nothing new from MS, wait until someone invents something and re-invent a bad version of it.
.. but I do still believe, it's quite a nice idea to become a Microsoft owner...;-) It's surprisingly stable and successful company, which actually sells a software, not just Internet ads or overpriced toys.
1 / 5 (1) Mar 11, 2012
big benefit for business, and tourists to ask for the directions if lost. also, diplomatic relations and tribal meetings benefit if translations are valid. I like this invention. it may prove very useful.
not rated yet Mar 11, 2012
I am still waiting for my talking computer, more like what apple did with their Iphone. I want to be able to just tell the computer to open a word document who I want it to go to and just say what needs to be in the letter. Open email, write it and send it all with voice. Then I want them to give me my flying car.
not rated yet Mar 11, 2012
Eh. These are gimmicks. The core features Microsoft is *not* bragging about are translation accuracy and avatar AI.

When Microsoft is ready to show off either or both, we'll be eager to hear about it.
1 / 5 (1) Mar 12, 2012
You can learn the sounds of just one language - the human language - to which, of course, there are about 5000 'parts' - which you label erroneously 'different' 'languages'.

Aren't you glade our hearing and voice is limited in frequency range? Makes learning the sounds of one language - the human language - much easy.

Much harder to learn for us are the languages of life forms utilizing sounds of an unlimited frequency range - we will sound to them, like we are repeating ourselves endlessly.

Which combinations of sounds within our vocal cord range can not be duplicated with our voice?

This reminds everyone of the binary nature of nature. Where the 'amount' of 'zeros' and 'ones' harbors the potential to represent any 'sound' to any arbitrary precision to convey what we have acquire through evolution the label and meaning of the word:
not rated yet Mar 12, 2012
Now we can speak to the Wookies without fear of pronouncing things incorrectly.
not rated yet Mar 14, 2012
@ bredmond: Greater numbers, bredmond, greater numbers of languages, not greater amounts.

what i mean is if he were to talk for a long time, would the translation be correct throughout the whole duration.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.