Bilingual avatar speaks Mundie language

Mar 10, 2012 by Nancy Owano report

(PhysOrg.com) -- This week's Microsoft Big Idea event, TechFest 2012, presented the latest advances on the part of researchers at Microsoft. A bilingual talking head received much of the attention. Called "Monolingual TTS," the Microsoft research effort involves software that can translate the user’s speech into another language and in a voice that sounds like the original user’s. As Microsoft explains, with the use of a speaker’s monolingual recording, the system's algorithm can render speech sentences in different languages for building "mixed coded bilingual text to speech (TTS) systems."

According to the team, “We have recordings of 26 languages which are used to build our TTS of corresponding languages. By using the new approach, we can synthesize any mixed language pair out of the 26 languages.”

The software does this by first “learning” what the user’s voice sounds like. The tool works by using speech recognition, followed by translation, followed by the final output in a different language. The demo at this week used an avatar of Craig Mundie, Microsoft's chief research and strategy officer, to illustrate the system in action.

A synthetic version of Mundie's voice, in English, welcomed the audience to Microsoft Research. Then the voice shifted to the same phrase in Mandarin. The words in Mandarin were reported to be recognizably Mundie’s voice.

This video is not supported by your browser at this time.
Craig Mundie's talking head speaks in English.

This video is not supported by your browser at this time.
Craig Mundie's talking head speaks in Chinese.

Some obvious applications might be in a wide range of service-related activities, from the hospitality and tourism market sectors to government workers making use of the software with communities at home and in their international travels.

"We will be able to do quite a few scenario applications," said Frank Soong, who is a principal researcher in Microsoft’s speech group. Soong helped create the system with his colleagues at Microsoft’s research lab in Beijing.

Microsoft, meanwhile, has had a vision for a while about virtual avatars being used along with this kind of technology. The vision is one where avatars not only look like their users, with photo-realistic effects, but can also successfully mimic their users’ voices and approximate their lip movements to put speech translation into instant, and personalized, action.

Last year, Mundie was on hand at the Microsoft Research Asia facility in Beijing, where he said that the coming-together of touch, vision, synthesis and recognition, will be an important advancement.

“Another dream we have is that I should be able to sit in my office, send my avatar to meet somebody in Beijing, and I can speak in English and the avatar speaks in Mandarin in real-time," he said. "We want the computer to be a simultaneous translator."

Explore further: Bringing history and the future to life with augmented reality

More information: research.microsoft.com/en-us/projects/photo-real_talking_head/
via Technology Review

Related Stories

MSI shows voice-controlled motherboard approach at IDF

Sep 19, 2011

(PhysOrg.com) -- Micro-Star International (MSI) revealed voice control via motherboard at the Intel Developer Conference. The MSI demo showed how its add-on PCIe x1 card can add voice control to selected Sandy Bridge motherboards ...

Google developing a translator for smartphones

Feb 09, 2010

(PhysOrg.com) -- Google is developing a translator for its Android smartphones that aims to almost instantly translate from one spoken language to another during phone calls.

Apple seeks patents for display and noise-out systems

Dec 11, 2011

(PhysOrg.com) -- Apple made patent news this week in two directions, toward a Kinect like system and toward a quest for excellence in sound quality on phones. It’s been reported that Apple has filed patent ...

Recommended for you

Patent talk: Google sharpens contact lens vision

Apr 16, 2014

(Phys.org) —A report from Patent Bolt brings us one step closer to what Google may have in mind in developing smart contact lenses. According to the discussion Google is interested in the concept of contact ...

Neuroscientist's idea wins new-toy award

Apr 15, 2014

When he was a child, Robijanto Soetedjo used to play with his electrically powered toys for a while and then, when he got bored, take them apart - much to the consternation of his parents.

Land Rover demos invisible bonnet / car hood (w/ video)

Apr 14, 2014

(Phys.org) —Land Rover has released a video demonstrating a part of its Discover Vision Concept—the invisible "bonnet" or as it's known in the U.S. the "hood" of the car. It's a concept the automaker ...

User comments : 16

Adjust slider to filter visible comments by rank

Display comments: newest first

default_ex
5 / 5 (1) Mar 10, 2012
Really Microsoft, your behind in times. Valve's had the facial morph tech that approximates visual emotion combined with lip syncing since HL2 was released. Not only that it's free for anyone that bought the game to play with to their heart's content. The only advanced I see there is applying Microsoft's horrible SAPI to synthesize the voice. The language choice is no coincidence either, hides a lot (though not all) of the shortcomings of Microsoft SAPI.
epsi00
5 / 5 (1) Mar 10, 2012
Nothing new from MS, wait until someone invents something and re-invent a bad version of it.
bredmond
not rated yet Mar 10, 2012
wow. his mandarin is great. i wonder how accurate it is with greater amounts of language.

these virtual avatars could become lifelong friends and life coaches. Just think of facebook and netflix connected with iphone apps and other smart devices, and with programming to help you find what you want whether it be multimedia content, study materials, news, etc. it can also identify unhealthy habits and counsel you in a way that is effective to your preferences: (computer sees the user is feeling upset, plays beethoven as per user's preferences and says: "bob, today for lunch, i have a plan. eat a banana, a smoked turkey sandwich with mustard and a slice of tomato, and a glass of milk.") anyway, i am just saying it can help monitor people's behavior and provide them with things to make their life, career and love more effective, and do so in a way that feels natural to the person.
Xynos21
not rated yet Mar 10, 2012
avatars? really? You think when I conduct a business meeting I'm gonna want to talk to someones avatar? This stuff looks great in movies but in the real world it's unpractical. What we need is something on the lines of a babel fish. The best way to market this would be through smartphone apps. Record the speaker,translate, then playback through a bluetooth. If successful then maybe you can consider avatars. left foot, right foot, left foot.
SiberskiyaGaluboy
1 / 5 (1) Mar 11, 2012
this is good invention as long as nothing gets lost in translation. definitions must be precise and grammar in both or all languages, otherwise misunderstandings may happen that could prevent good partnerships or business.
Sanescience
not rated yet Mar 11, 2012
Think of the bandwidth savings if all you needed to send was the text!
PhotonX
not rated yet Mar 11, 2012
Wow. Now I'll have someone to talk to while my Google car drives itself to work. Oh, wait, that's what cell phones are for, I guess.

@Sanescience: Made me laugh. I was just thinking a day or two ago that the old 300 baud bandwidth admonishments had died along with the BBS world. If only we all had seen the future of streaming video....
.
wow. his mandarin is great. i wonder how accurate it is with greater amounts of language.

.
@ bredmond: Greater numbers, bredmond, greater numbers of languages, not greater amounts. Since this is an article about languages, I'll take the opportunity to nit pick on usage where I wouldn't usually do so (just kidding, rest assured). Now, everybody, feel free to pick out the usage errors in *my* post. There are at least two, I think.
PoponDex
5 / 5 (1) Mar 11, 2012
I dont get why you dont think this is not a good invention.. being able to talk in real time with clients of any language would be amazing.
Sonhouse
5 / 5 (1) Mar 11, 2012
Interesting that the mandarin version sounded more natural than the english.

The english version sounded more like Stephan Hawking still.
Callippo
not rated yet Mar 11, 2012
Nothing new from MS, wait until someone invents something and re-invent a bad version of it.
.. but I do still believe, it's quite a nice idea to become a Microsoft owner...;-) It's surprisingly stable and successful company, which actually sells a software, not just Internet ads or overpriced toys.
SiberskiyaGaluboy
1 / 5 (1) Mar 11, 2012
big benefit for business, and tourists to ask for the directions if lost. also, diplomatic relations and tribal meetings benefit if translations are valid. I like this invention. it may prove very useful.
Feldagast
not rated yet Mar 11, 2012
I am still waiting for my talking computer, more like what apple did with their Iphone. I want to be able to just tell the computer to open a word document who I want it to go to and just say what needs to be in the letter. Open email, write it and send it all with voice. Then I want them to give me my flying car.
Urgelt
not rated yet Mar 11, 2012
Eh. These are gimmicks. The core features Microsoft is *not* bragging about are translation accuracy and avatar AI.

When Microsoft is ready to show off either or both, we'll be eager to hear about it.
Tausch
1 / 5 (1) Mar 12, 2012
You can learn the sounds of just one language - the human language - to which, of course, there are about 5000 'parts' - which you label erroneously 'different' 'languages'.

Aren't you glade our hearing and voice is limited in frequency range? Makes learning the sounds of one language - the human language - much easy.

Much harder to learn for us are the languages of life forms utilizing sounds of an unlimited frequency range - we will sound to them, like we are repeating ourselves endlessly.

Which combinations of sounds within our vocal cord range can not be duplicated with our voice?

This reminds everyone of the binary nature of nature. Where the 'amount' of 'zeros' and 'ones' harbors the potential to represent any 'sound' to any arbitrary precision to convey what we have acquire through evolution the label and meaning of the word:
'meaning'.
HydraulicsNath
not rated yet Mar 12, 2012
Now we can speak to the Wookies without fear of pronouncing things incorrectly.
bredmond
not rated yet Mar 14, 2012
@ bredmond: Greater numbers, bredmond, greater numbers of languages, not greater amounts.

what i mean is if he were to talk for a long time, would the translation be correct throughout the whole duration.

More news stories

Under some LED bulbs whites aren't 'whiter than white'

For years, companies have been adding whiteners to laundry detergent, paints, plastics, paper and fabrics to make whites look "whiter than white," but now, with a switch away from incandescent and fluorescent lighting, different ...

Hackathon team's GoogolPlex gives Siri extra powers

(Phys.org) —Four freshmen at the University of Pennsylvania have taken Apple's personal assistant Siri to behave as a graduate-level executive assistant which, when asked, is capable of adjusting the temperature ...

Researchers uncover likely creator of Bitcoin

The primary author of the celebrated Bitcoin paper, and therefore probable creator of Bitcoin, is most likely Nick Szabo, a blogger and former George Washington University law professor, according to students ...

Continents may be a key feature of Super-Earths

Huge Earth-like planets that have both continents and oceans may be better at harboring extraterrestrial life than those that are water-only worlds. A new study gives hope for the possibility that many super-Earth ...

Researchers successfully clone adult human stem cells

(Phys.org) —An international team of researchers, led by Robert Lanza, of Advanced Cell Technology, has announced that they have performed the first successful cloning of adult human skin cells into stem ...