In a bizarre-sounding experiment which walks a tightrope between Orwellian voyeurism and ingenious innovation, researchers at MIT have come up with an algorithm, which can listen to a voice and guess the face of the speaker with decent accuracy.
Picking information like gender, race or culture from social cues like speech or song is what humans have subconsciously devoted themselves throughout their evolutionary past. Now we can easily recognise the voice of a person be it through wireless communication or behind the wall. We can imagine the face of the person in case it is familiar or at least pick up whether it belongs to male or female based on pitch.
Now, imagine machines doing the same. It is eerie and exciting at the same time.
The authors of this paper, trained a neural network using millions of videos on the internet.
“During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity. This is done in a self-supervised manner, by utilizing the natural co-occurrence of faces and speech in Internet videos, without the need to model attributes explicitly,” wrote the authors in their paper titled Speech2Face: Learning the Face Behind a Voice.
The picture below contains the speaker’s image in the first column followed by the results of the model.
Results illustrating the accuracy of the experiment via Speech2Face paper
How Does Speech2Face Model Work
Regressing from input speech to image pixels is not as impossible as it sounds because a model has to learn to factor out of many irrelevant variations in the data and to implicitly extract a meaningful internal representation of faces.
To sidestep these challenges, the researchers train their model to regress to a low-dimensional intermediate representation of the face by utilising the VGG-Face model.
Speech2Face pipeline consists of two main components:
- a voice encoder, which takes a complex spectrogram of speech as input, and predicts a low-dimensional face feature that would correspond to the associated face; and
- a face decoder, which takes as input the face features and produces an image of the face in a canonical form (frontal-facing and with neutral expression)
During training, the face decoder is fixed, and only the voice encoder is trained that predicts the face feature. Whereas, the face decoder model is developed using face normalization model.
The voice encoder module is a convolutional neural network (CNN) that turns the spectrogram of a short input speech into a pseudo-face feature, which is subsequently fed into the face decoder to reconstruct the face image.
This voice encoder is trained in a self-supervised manner, using the natural co-occurrence of a speaker’s speech and facial images in videos.
Up to 6 seconds of audio is taken from the beginning of each video clip in AVSpeech. If the video clip is shorter than 6 seconds, the audio is repeated such that it becomes at least 6-seconds long.
The resulting training and test sets include 1.7 and 0.15 million spectra–face feature pairs, respectively. The whole network is implemented in TensorFlow and optimized by ADAM with learning rate set at 0.001.
The results show that for age and gender the classification results are highly correlated. For gender, there is an agreement of 94 % in male/female labels between the true images and the reconstructions from speech. For ethnicity, there is a good correlation on the “white” and “Asian”, but less agreement on “India” and “black”.
The authors clearly have stated in their paper that this research is purely an academic investigation, the implications of which can be of a wide range- from eavesdropping and identifying speaker in remote locations to giving a voice to those with speech impediments by reverse engineering their facial features. However, it may be some time before it becomes reality.
Read more about the work here