What is Sound Source Localization?
As the name suggests, sound source localization means to determine where the sound of our source of interest originates.
Sound source localization can be broken down further depending on the environment of where the sound originates from. Imagine someone in an underground garage clapping their hands. After the clapping stopped, their sound waves reflections will linger in the room for a short period after. These acoustic reflections are a part of a reverberant environment in which the reflections interfere with the direct sound arriving at the listener's ears, distorting the spatial cues for sound localization. The ear perceives the sound to be farther or closer than it is, which adds another layer to the problem. Although humans can quickly localize sound sources in moderate reverberation, localization accuracy degrades in a stronger reverberant environment. With this in mind, the need for a breakthrough in technology to solve this has proven to be in dire need.
Where can we use sound source localization? Why is it important? Imagine being inside a Madison Square Garden concert game or concert. Everyone around you is yelling at the top of their lungs, except one person. Now, we want to find that one person. Sound source localization will help us isolate this person and determine where he or she is in the crowd. While this is a trivial example, multiple applications require sound source localization such as in hearing aids, robotics, navigation for ships as well as self-driving cars, and in surveillance too.
Previous work in sound source localization has concerned the design of microphone arrays and the use of digital signal processing techniques.
These techniques can be broken up into four groups: Time difference of arrival (TDOA) methods, beamforming ones, methods using high-resolution processing, and the processes which need a training phase. Time difference of arrival (TDOA) is a technique that involves using two or more receivers to locate a signal source from the different arrival times at the receivers. In our case, it is a sound source signal. Popular techniques used to estimate TDOA are the Generalized Cross Correlation (GCC) and its derivatives, such as Generalized Cross-Correlation using Phase Transform (GCC-PHAT) and the Cross Power Spectrum Phase (CSP). However, these methods are defined for an environment without any vibration, so they do not help localize reverberated sound sources.
Beamforming, on the other hand, or spatial filtering is a signal processing technique that combines elements in an antenna array in such a way that at particular angles signals experience constructive interference while others experience destructive interference. Using a microphone array, beamforming will help isolate the source of the sound. The best-known beamforming approaches are the Minimum Variance Distortionless Response (MVDR), and linearly Constrained Minimum Variance (LCMV) method. However, when a microphone array is faced with multiple sound sources, the TDOA and beamforming approaches are not successful in finding the source. Hence, the other two methods were created.
Next, the methods using high-resolution processing, known as subspace localization methods, utilize the spectra estimation, and perform better than in comparison to the TDOA and beamforming approaches. Common examples of subspace localization methods are the (MUSIC), Estimation of Signal Parameters via Rotational Invariant Technique (ESPRIT) and root-MUSIC. Due to the nature of the reverberant environment, other methods such as the Recursively Applied and Projected MUSIC, RAP-MUSIC, and Self-Consistent MUSIC are other choices as well but are not widely implemented.
Finally, the last approach is a reasonably recent advancement. A new method, based on the phase information of the MUSIC spectra, for localization of very closed-source with the limited number of sensors, has been proposed in a journal paper. However, because of its novelty, there is not much more to report, and more work needs to be conducted before one can test its usefulness.
What to expect in the Future? Unlike humans, the machines that use these techniques are not as robust in all environments and cannot find the source because they assume the source to be either stationary or in a non-reverberant environment. SONAR and RADAR are extremely useful navigation systems because transmitting or finding vessels in a setting where the reverberations are not so high—underwater sound waves—is a simple procedure. However, if SONAR or RADAR were used in a glass room to find a vessel, the results would not be promising. These limitations need to be surpassed, so technology can accurately locate the origin of the sound. With the recent advancement of personal data assistants such as Google Assistant and Siri, there has been a lot of development in the Speech-Language Processing field. The rise brings about new methods to solve the source localization problem.
In this decade, machine learning will help alleviate problems in solving sound source localization in almost all environments. In particular, deep learning, a subset of machine learning, has yielded some exciting results in terms of detecting the sources with networks like SELD-net. However, at the moment, the advancements are extremely limited.
Written by Akhil Vasvani & Edited by Qilin Guo & Alexander Fleiss