Researchers develop more comprehensive acoustic scene analysis method

April 10, 2018, Chinese Association of Automation
Researchers develop more comprehensive acoustic scene analysis method
Predictions of sounds were achieved by an improved method developed by an international team of researchers. Credit: IEEE/CAA Journal of Automatica Sinica

Researchers have demonstrated an improved method for audio analysis machines to process our noisy world. Their approach hinges on the combination of scalograms and spectrograms—the visual representations of audio—as well as convolutional neural networks (CNNs), the learning tool machines use to better analyze visual images. In this case, the visual images are used to analyze audio to better identify and classify sound.

The team published their results in the journal IEEE/CAA Journal of Automatica Sinica (JAS), a joint publication of the IEEE and the Chinese Association of Automation.

"Machines have made great progress in the analysis of speech and music, but general sound analysis has been lagging a big behind—usually, mostly isolated sound 'events' such as gun shots and the like have been targeted in the past," said Björn Schuller, a professor and chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg in Germany, who led the research. "Real-world audio is usually a highly blended mix of different sound sources—each of which have different states and traits."

Schuller points to the sound of a car as an example. It's not a singular audio event; rather different parts of the car's parts, its tires interacting with the road, the car's brand and speed all provide their own unique signatures.

"At the same time, there may be music or speech in the car," said Schuller, who is also an associate professor of Machine Learning at Imperial College London, and a visiting professor in the School of Computer Science and Technology at the Harbin Institute of Technology in China. "Once computers can understand all parts of this 'acoustic scene', they will be considerably better at decomposing it into each part and attribute each part as described."

Spectrograms provide a visual representation of audio scenes, but they have a fixed time-frequency resolution, that is the time at which frequencies change. Scalograms, on the other hand, offer a more detailed visual representation of acoustic scenes than spectrograms, for instance, acoustic scenes like the music or the speech or other sounds in the car now can be better represented.

"There are usually multiple sounds happening in one scene so... there should be multiple frequencies and they change with time," said Zhao Ren, an author on the paper and a Ph.D. candidate at the University of Augsburg who works with Schuller. "Fortunately, scalograms could solve this problem exactly since it incorporates multiple scales."

"Scalograms can be employed to help spectrograms in extracting features for acoustic scene classification," Ren said, and both spectrograms and scalograms need to be able to learn to continue improving.

"Further, pre-trained build a bridge between [the] image and audio processing."

The pre-trained neural networks the authors used are Convolutional Neural Networks (CNNs). CNNs are inspired by how neurons work in animals' visual cortex and the can be used to successfully process visual imagery. Such networks are crucial in machine learning, and in this case, helping improve the scalograms.

CNNs receive some training before they're applied to a scene, but they mostly learn from exposure. By learning sounds from a combination of different frequencies and scales, the algorithm can better predict the sources and, eventually, predict the result of an unusual noise, such as a car engine malfunction.

"The ultimate goal is machine hearing/listening in a holistic fashion... across speech, music, and sound just like a human being would," Schuller said, noting that this would combine with the already advanced work in speech analysis to provide a richer and deeper understanding, "to then be able to get 'the whole picture' in the audio."

Explore further: 'Bat detectives' train new algorithms to discern bat calls in noisy recordings

More information: Zhao Ren et al, Deep Scalogram Representations for Acoustic Scene Classification, IEEE/CAA Journal of Automatica Sinica (2018). DOI: 10.1109/JAS.2018.7511066

Related Stories

Computer learns to recognize sounds by watching video

December 1, 2016

In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.

Recommended for you

Google braces for huge EU fine over Android

July 18, 2018

Google prepared Wednesday to be hit with huge EU fine for freezing out rivals of its Android mobile phone system in a ruling that could spark new tensions between Brussels and Washington.

EU set to fine Google billions over Android: sources

July 17, 2018

The EU is set to fine US internet giant Google several billion euros this week for freezing out rivals of its Android mobile phone system, sources said, in a ruling that risks fresh tensions with Washington.

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.