A Test in Producing a Visual Capture of Speech

June 14, 2010 By La Monica Everett-Haynes
A Test in Producing a Visual Capture of Speech
The top row in the image to the right indicates ultrasound inputs. The second row represents tongue contours drawn by a human. The third row represents the automatic system’s raw output, with the bottom row showing the contour after they had been processed. (Image courtesy of: Diana Archangeli and Ian Fasel)

(PhysOrg.com) -- Diana Archangeli, a UA linguistics professor, is heading up a team using ultrasound and other devices to create a technology that would enable the detection of words without auditory cues.

Simply studying how people speak will not lead to the best understanding of why some individuals have difficulty pronouncing certain words or learning a second .

To address this challenge, a University of Arizona research team is using and other equipment to produce technology that would map the mouth's interior to aid in analyzing exactly how sound is produced.

Diana Archangeli, a UA linguistics professor, is heading up the project, "Arizona Articulatory, Acoustic, and Visual Database." It is meant to improve what is known about how words are formed in the mouth and, thereby, advancing what is known about speech.

"Learning a language involves mastering the exquisitely-timed coordination of multiple articulators - lips, tongue, velum and glottis - yet all but the lips are invisible to the language learner, hidden within the mouth," Archangeli said.

"When someone is listening to speech, there is often a lot of other noise going on," she said. "One question we are trying to figure out is what we do when the audio signal isn’t enough?"

The project has implications for different types of language research, people learning to play certain wind instruments and may eventually aid individuals who have had their removed.

The team recently earned a $30,000 Arts, Humanities & Social Sciences Grants for Faculty grant, a UA funding program established to aid University researchers in transitioning promising projects from conception to application. Other members include Ian Fasel, an assistant research professor of computer science, and Jeff Berry and Jae Hyun Sung, both graduate students in the department.

For now, Archangeli's team is using equipment to reocrd lip and tongue movements by capturing video recordings of the mouth, jaw and tongue while simultaneously recording the audio signal and measuring both vocal fold vibration and nasal airflow.

"If we want to understand how people make sound, you want to measure everything," Fasel said.

While other technologies exist to capture such images, including X-ray and magnetic resonance imaging, or MRI, the team opted to use ultrasound - positioned between the chin when recording data - because it is non-toxic, and portable, both for fieldwork and classroom purposes.

The team's research will be fed into a new database to be called TIMIT-UA, which will expand upon TIMIT, a widely used speech recognition database that has been supported by the work of researchers at Texas Instruments and the Massachusetts Institute of Technology. The UA's contribution is the addition of ultrasound and video data.

Though the project only recently earned grant funding, the team's work already has been accepted for presentation.

Fasel co-authored a paper that been accepted for presentation during the Computational Neuroscience Meeting to be held in July in San Antonio.

Next month, Archangeli, Fasel and Berry's collaborative research will be presented at Laboratory Phonology in Albuquerque. Also, another paper co-authored by Fasel and Berry is slated to be presented at the International Conference on Pattern Recognition, to be held in Istanbul, Turkey in August.

Paul Cohen, the UA computer science department head, said he expects the team members and their research - particularly with the database - will garner increased attention.

"I am quite pleased that this work is concerned equally with advancing our scientific understanding of human production and perception of language as it is with potential technological applications," said Cohen, who also directs the UA School of Information Sciences, Technology and Arts.

He said the project is a clear example of researchers utilizing advanced machine learning technologies with broad-based uses and to improve knowledge in the social sciences.

For instance, it takes a skilled researcher about 20 minutes to manually draw a four-second video of tongue contours - a taxing and time-consuming effort to improve speech recognition, said Berry, a doctoral degree student.

But producing a computer program that would be able to collect and interpret such data more quickly and without human intervention - as the team intends - would be a boon, Berry said. The team intends for its technology, when complete, to be able to capture and trace 30 frames per second in real time.

Fasel and Archangeli said this is the type of application that would be especially helpful in classrooms where educators are teaching language or music.

Archangeli offered for an example the different ways to articulate the sound of the sound "R" in speaking in English, one in which the tongue slopes upward; the other in which the bunches up, forming a dome just behind the teeth.

This, Archangeli said, may contribute to reasons why “R” is a challenging sound for some learners of English to master.

But if learners are able to see their tongues in motion, as the team's technology allows, this would help improve language acquisition, Fasel said.

"Now, an expert is required," Fasel added. "But if you had a computer system to do this, you can imagine this being in classrooms and helping students learn."

Explore further: IBM Technology Improves English Speaking Skills

Related Stories

IBM Technology Improves English Speaking Skills

October 26, 2006

Researchers at IBM's India Research Laboratory today announced that they have developed a Web-based, interactive language technology to help people who speak English as a second language improve their speaking skills.

Classifying 'clicks'

July 15, 2009

A new way to classify sounds in some human languages may solve a problem that has plagued linguists for nearly 100 years--how to accurately describe click sounds distinct to certain African languages.

Scholar helps classify clicks in African languages

October 22, 2009

(PhysOrg.com) -- Linguistics scholar Amanda Miller is doing research with high-speed ultrasound technology to help her and fellow researchers successfully record and classify clicks in an endangered African language.

Model to Help Patients See How to Sound Out Words

June 10, 2010

(PhysOrg.com) -- Traditionally, speech-language pathologists have relied on a patient’s sense of hearing to improve speech sounds. A team of researchers from UT Dallas is hoping to change that by creating a new high-tech ...

Recommended for you

How the finch changes its tune

August 3, 2015

Like top musicians, songbirds train from a young age to weed out errors and trim variability from their songs, ultimately becoming consistent and reliable performers. But as with human musicians, even the best are not machines. ...

Cow embryos reveal new type of chromosome chimera

May 27, 2016

I've often wondered what happens between the time an egg is fertilized and the time the ball of cells that it becomes nestles into the uterine lining. It's a period that we know very little about, a black box of developmental ...

Shaving time to test antidotes for nerve agents

February 29, 2016

Imagine you wanted to know how much energy it took to bike up a mountain, but couldn't finish the ride to the peak yourself. So, to get the total energy required, you and a team of friends strap energy meters to your bikes ...


Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.