Behind Hey Siri: How Apple’s AI-Powered Personal Assistant Uses DNN

Published on February 16, 2018

by Abhishek Sharma

“Hey Siri” — this is the catchphrase that you use on your Apple devices when you feel helpless or bored. Siri is a built-in personal assistant chatbot, introduced in 2011 for the smartphone. Siri can help with the user with tasks such as getting information from the internet, scheduling events, setting a timer and making phone calls, among other things.

Astonishingly, Siri is powered by a speech recognition unit present in the phone which runs in the background all the time. This speech recogniser uses a Deep Neural Network (DNN) to correspond your voice patterns which are then generated as a probability distribution for those voice sounds. A process called Temporal Integration is used to compute a confidence score to check whether your voice contained the words ‘Hey Siri’. If the words are close, then Siri is activated, otherwise it isn’t. This article gives a brief look into the machine learning aspects behind Siri.

Behind the scenes

How Siri works, *Image courtesy : Apple*

The power to use Siri without your hands is what makes it interesting and popular as well. As shown in the figure above, the critical components are the cloud servers and the voice detection hardware present in the phone. All of these work in tandem with the cloud servers including the main automatic speech recognition, the natural language interpretation and other information services. The voice patterns are updated in the server regularly.

DNN And The Hardware

The microphone or detector in an iPhone or any Apple products such as iPad, iPod Touch and Apple Watch, turns the detected voice into a stream of instantaneous waveform samples which are created at a rate of 16,000 per second. A spectrum analysis stage converts the waveform sample stream to a sequence of frames splitting the voice into a spectrum of 0.01 second. About 20 of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) is now set to work, which converts each of these voice pattern models into a probability distribution over a set of speech sound classes which are used in the “Hey Siri” phrase (among other voice patterns) for a total of close to 20 sound classes categorised by Apple Inc.

The DNN consists of matrix multiplications and logistic nonlinearities. Each separate layer is an intermediate representation identified by the DNN during its training to convert the filter bank inputs to sound classes. The final nonlinearity is primarily a Softmax function (also known as general logistic or normalized exponential), the reason is to choose logarithmic probabilities over linear ones to make the computation easier.

Neural Network Structure, *Image Courtesy : Apple*

Networks that Apple uses typically have five hidden layers of all the same size: 32, 128, or 192 units depending on the memory,power and hardware criteria.In an iPhone, there are two networks for Siri’s functionality namely — initial detection and secondary checker. The output of the voice pattern is compared to a phonetic class ( to check whether the letter ‘S is preceded by vowel ‘i’). To ascertain whether the voice pattern model hits the “Hey Siri” phrase correctly, the pattern is computed using the function given below to accommodate the pattern values in a sequence

F_i,t = max { s_i + F_i,t-1, m_i-1 + F_i-1,t-1} + q_i,t

where

F_i,t is the accumulated score for state i of the model
q_i,t is the output of the acoustic model — the log score for the phonetic class associated with the ith state given the acoustic pattern around time t
s_i is a cost associated with staying in state i
m_i is a cost for moving on from state

The ‘s’ and ‘m’ components account for the acoustic analysis of relevant data.

Now this computation is done on the hardware at quick speeds providing a feed pattern to the hardware, which sees whether the phrase matches “Hey Siri” or not. Apple uses a threshold value to check phrases. This is how it Siri functions on the iPhone.

Siri not only has to be very responsive, but also accurate. This is possible with the iPhone’s Always On Processor (AOP) which is an auxiliary processor powering the microphone (for iPhone 6S and later). The AOP alerts the main processor when the threshold value for the phrase is received and activates the DNN of the main processor for complete processing of the query such as information from Internet, help with calling and texting and many more features.

Conclusion :

Apple has been utilising the multifold benefits of machine learning since its inception. Not just Siri, it also is exploring options with other products such as Apple Watch to make it even better and simple.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Abhishek Sharma

I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

India Draws Inspiration from Census To Collect Data for AI

Pritam Bordoloi

India’s census efforts involved sending trained enumerators to every household in India and collecting data based on various socio-economic parameters.