In a far-field setting, a typical HomePod has to overcome the echo, reverberation and noise. Apple has tweaked its virtual assistant Siri’s machine learning algorithms to overcome the challenges with lurking background noise.
Apple researchers explained, “A complete online system, which addresses all of the environmental issues that HomePod can experience, requires a tight integration of various multichannel signal processing technologies.”
By integrating supervised deep learning and online learning algorithms, the Audio Software Engineering team at Apple were able to enable Siri to filter out the noise. The system selects the optimal audio stream for the speech recognizer by using top-down knowledge from the “Hey Siri” trigger phrase detectors.
Siri is designed to work in challenging environments be it incoherent loud music or even if the user is far away from the device. So, whatever be the source of noise, Siri is supposed to recognise the talker’s instructions and carry them out.
A typical smart speaker system primarily focuses on noise suppression and de-reverberation. With deep learning, speech enhancement performance has improved significantly. But, these techniques are modelled on the presumption that full speech utterances are available during runtime and the system runs through all available samples increasing latency. But for a HomePod, batch speech enhancements for phrase trigger detection mode is unrealistic since the acoustic conditions are unpredictable.
Add to this, an extra source of voice like a TV in the same room as the talker is, and far-field speech recognition becomes really difficult.
Techniques like independent component analysis and clustering do well with batches of synthetic mixtures. But, these techniques falter in far-field scenarios.
So, Apple has finally decided to investigate the effect of source separation on voice trigger detection in case of “Hey Siri” and to avoid latencies by decoding only the target stream containing the voice command.
Siri Enhancements For HomePod
HomePod’s multichannel signal processing system follows these 2 approaches:
- Mask-based multichannel filtering using deep learning to remove echo and background noise
- Unsupervised learning to separate simultaneous sound sources and trigger-phrase based stream selection to eliminate interfering speech
The device is equipped with 6 microphones and an Apple A8 chip to carry out the multichannel signal processing even when the power levels are low.
The aim of the multichannel signal processing system is to extract one of the speech sources by removing echo, reverberation, noise, and competing talkers to improve intelligibility as illustrated in the figure below.
via Apple’s paper
Echo Cancellation And Suppression
The echo signals may be 30-40 dB louder than the far-field speech signals, resulting in the trigger phrase being undetectable on the microphones during loud music playback. In this algorithm, the Siri speech team have used a set of linear adaptive filters to model the acoustic paths between the loudspeakers and microphones for acoustic de-coupling.
Since the nonlinearity associated with the loudspeakers and the mechanical vibrations of the devices is high, a linear model fails to capture the playback signal. Therefore, a residual echo suppressor(RES) is needed to tackle this.
A deep neural network takes in multiple input features and outputs an estimate of a speech activity mask that is used as an input for probabilistic determination of the presence of speech.
Mask-based echo suppression approach does well, compared to others, because the deep neural networks are trained on actual echo recordings so that it learns to suppress the echo signal accompanying loudspeaker nonlinearities and mechanical vibrations which are specific to HomePod.
Dereverberation With Deep Learning
There are three types of signals a microphone captures in a typical room:
- Speech without any reflection
- Early reflections
- Late reverberation
As the speaker moves farther away from the device, the walls create reverberation tails which degrade the target speech and make it unintelligible.
Siri on HomePod overcomes this challenge of late reverberation by monitoring the room characteristics continuously and preserving the high-quality information contained in the first two signals.
Along with reverberations, there are other factors that contaminate the signals in a far-field speech such as noise from an air conditioner or sounds of on-road traffic through the glass window.
So, to deal with the constantly changing acoustic environment, an online noise reduction system is needed which can track environmental noise without any delays.
Apple’s speech software engineering team has deployed a DNN in addition to the speech probability predictors. These networks are trained on internally collected data using both diffuse and directional noises.
The input to the network is calculated using the de-reverberated signal and reverberation estimate. And, the output features by mixing speech-plus-noise and near-end-speech.
Blind Source Separation is an unsupervised technique used to separate multiple audio sources into individual audio streams. But, this separation is challenging since this requires top-down knowledge of the user’s voice command.
So, a competing talker separation approach is required along with “Hey Siri” cue to identify the target stream.
In this approach, the team deployed an unsupervised machine learning model to assist the blind source separation algorithm, which also runs light on computations.
The deep learning based stream selection allows the device to differentiate between quiet and noisy backgrounds or loud music and competing talker environments. This deep learning model assigns a goodness sore whenever an audio input contains “Hey Siri” and then the highest score is selected and sent to Siri for speech recognition.
So far, the HomePod has been tested in difficult acoustic conditions like podcast playback, the sound of rain, vacuum cleaner, microwave and many other environments with audio interferences.
The results show a false rejection rate improvement of 29%, that is, “Hey Siri ” detection with reverberation, echo etc.
So, combining the multichannel processing with deep learning models have shown significant improvement in reducing the error rates due to falsely rejected utterances.
Error rate reduction graph via Apple
The word error rate (WER) relative improvements are about 40%, 90%, 74%, and 61% in the four investigated acoustic conditions of reverberant speech only, playback, loud background noise, and competing talker, respectively with deep learning based signal processing.
So, in the near future, irrespective of which room you are in, Siri will be able to take notes, set alarms and play your next favourite song on demand.