"Robot audition" is a research area that was proposed by Adjunct Professor Kazuhiro Nakadai of Tokyo Institute of Technology (Tokyo Tech) and Professor Hiroshi G. Okuno of Waseda University in 2000. Until then, robots had not been able to recognize voices unless a microphone was near a person's mouth. Development of "robot ears" began advancing with the idea that robots, like humans, should hear sound with their own ears. The entry barrier for this research area was high, since it involves a combination of signal processing, robotics, and artificial intelligence. However, vigorous activities since its proposal, including the publication of open source software, culminated in its official registration as a research area in 2014 by the IEEE Robotics and Automation Society (RAS), the largest community for robot research.
- The three keys for making "robot ears" a reality are
- sound source localization technology to estimate where sound is coming from
- sound source separation technology to extract the direction from which the sound originates, and
- automatic speech recognition technology to recognize separated sounds from background noise, similar to how humans can recognize speech from across a noisy lot.
The research team pursued techniques to implement these keys in real environments and in real time. They developed the technology that, like the legendary Japanese Prince Shotoku, could distinguish simultaneous speech from multiple people. They have, among other projects, demonstrated simultaneous meal ordering by 11 people and created a robot game show host that can handle multiple contestants answering simultaneously.
This technology is the result of extreme audition research led by program manager Satoshi Tadokoro of Tohoku University. A system that can detect voices, mobile device sounds and other sounds from disaster victims through the background noise of a drone has been developed to assist in faster victim recovery.
Assistant Professor Taro Suzuki of Waseda University provided the high-accuracy point cloud map data, an outcome of his research on high-performance GPS. The group performing the extreme audition research, Nakadai, Okuno, and Associate Professor Makoto Kumon of Kumamoto University, were central in developing this system, the first of its kind worldwide.
This system is made up of three main technical elements. The first is the microphone array technology based on robot audition open source software called HRI-JP Audition for Robots with Kyoto University (HARK). HARK has been updated every year since its 2008 release, and exceeded 120,000 total downloads as of December 2017. The software was extended to support embedded use while also maintaining its noise robustness. Researchers then embedded this version of HARK on a drone to decrease its weight and take advantage of high-speed data processing. They realized that microphone array processing could be performed inside a microphone array device attached to the drone—it is not necessary to send all of the captured signals to a base station wirelessly. The total data transmission volume was dramatically reduced to less than 1/100. This made it possible to detect sound sources even through the noise generated by the drone itself.
The second element is a three-dimensional sound source location estimation technology with map display. This made it possible to construct an easily understood visual user interface out of invisible sound sources.
The final element is an all-weather microphone array consisting of 16 microphones all connected by one cable for easy installation on a drone. This makes it possible to perform a search and rescue even in adverse weather.
It is generally accepted that survival probability is drastically reduced for victims that are not rescued within the first 72 hours after a disaster. Establishing technology for a swift search and rescue has been a pressing issue.
Most existing technologies using drones to search for disaster victims make use of cameras or similar devices. Not being able to use them when victims are difficult to find or are in areas where cameras are ineffective, such as when victims are buried or are in the dark, has been a major impediment in search and rescue operations. Since this technology detects sounds made by disaster victims, it may be able to mitigate such problems. It is expected to result in promising tools for rescue teams in the near future, as drones for finding victims needing rescue in disaster areas become widely available.
The research group will continue to work toward improving the system to make it even easier to use and more robust by continuing to perform demonstrations and experiments in simulated disaster conditions. One goal is to add a functionality for classifying sound source types, instead of simply detecting them, so that relevant sound sources from victims can be distinguished from irrelevant sources. Another goal is to develop the system as a package of intelligent sensors that can be connected to various types of drones.
Explore further: HEARBO robot can tell beeps, notes, and spoken word (w/ Video)