This might come across surprising for lot many AI enthusiasts out there, but the technology actually fosters the capability to create fake audio and video, which is difficult to distinguish from reality. In a recent feat, scientists at University of Washington created an AI software that could generate highly realistic fake videos of former president Barack Obama using existing audio and video clips of him.
The tool essentially takes audio files, converts them into realistic mouth movements, and then grafts those movements onto existing video. The resultant video shows someone saying something they didn’t.
University of Washington scientists had previously revealed that the tool could be utilized for generating digital doppelgangers of anyone by simply analyzing their images. This could include celebrities such as Tom Hanks and Arnold Schwarzenegger, or even political figures like George W. Bush and Barack Obama, whose images are easily available on the internet.
The research was funded by Samsung, Google, Facebook, Intel, and the University of Washington. The findings of this project will be detailed on August 2nd at the SIGGRAPH conference held in Los Angeles. Researchers claimed that it might soon be possible to generate digital models of a person for virtual reality or augmented reality applications.
Why did researchers at University of Washington choose Obama for the project?
Obama turned out to be the best public figure for this AI-based project. This is largely due to the fact that there are hours of high-definition video of him available online in the public domain.
The research team saw this as a huge opportunity to test their software. They had a neural net analyze millions of frames of video to determine how elements of Obama’s face moved as he talked, such as his lips and teeth, and wrinkles around his mouth and chin.
How does the AI-based software work?
An artificial neural network usually comprises of components known as artificial neurons which are fed with the data. They work together to solve a problem, for instance, identifying faces or recognizing speech, following which the neural net alters the pattern of connections among those neurons to change the way they interact. Next, the network tries to resolve the problem. The neural net learn which patterns are best at computing solutions, over time. This essentially represents an AI strategy that mimics the human brain.
However, the new study involved neural net learning what mouth shapes were linked to various sounds, for which researchers took audio clips and dubbed them over the original sound files of a video. In the consecutive step, they took mouth shapes that matched the new audio clips, which were further grafted and blended onto the video. They also synthesized videos of Obama where he lip-synced words he must have uttered ages back.
Previously, the researchers partook in a similar project, which involved filming people saying sentences over and over again, to map what mouth shapes were linked to various sounds. But, this process was not only expensive and tedious, but it also consumed a good deal of time. On the other hand, the new software can learn from hours of video that already exist on the internet.
One noteworthy aspect here is the fact that the link between mouth shapes and utterances may be universal for people to some extent. Basically speaking, a neural network trained on Obama and other public figures could be adapted to work for several other people.
The concern surrounding application of the AI-based software
Once you have such a resourceful tool capable of leveraging AI extensively, there will always be concerns surrounding the software’s abuse. It becomes really simple to generate misleading video footage once you have your hands on a tool as such. It can indeed turn out scary when you think about the pitfalls of having this technology in the wrong hands.
Keeping this in mind, the researchers at University of Washington were cautious about not generating videos where they put words in Obama’s mouth, that he had not utter himself.
How useful can the software be?
Improving videoconferencing could be one potential application for this new technology, as claimed by Ira Kemelmacher-Shlizerman, a co-author to this project. Teleconferencing video feeds may often stutter, freeze, or suffer from low-resolution; but the audio feeds mostly work.
In the future, videoconferencing may simply transmit audio from people and this software could be leveraged to reconstruct what they might have looked like while they talked. Besides, such software could also help people talk with digital copies of a person in virtual reality or augmented reality applications.
This is not all, the technology can be applied to detect fake videos in the future. It could be a difficult task at times for anyone to make out if a video is fake or real, as the differences might not be so distinct to human eyes. However, an AI-based program that can compare the blurriness of the mouth region to the rest of the video can be easily developed.
Looking into the future of such technological advances
This project clearly illustrates how training on a large amount of video of the same person, and designing algorithms with the goal of photo-realism in mind, can help in creating believable video from audio with convincing lip sync. As discussed earlier, such projects open up a number of interesting future directions.
It could be however difficult to train the system on another person, for instance, a non-celebrity. This is because it’s not an easy task to obtain hours of training data. However, there are chances that association between mouth shapes and utterances could be speaker-independent. In other words, we could perhaps retrain the network used in Obama’s case for another person with lesser need for additional training data.
Talking about taking the system a notch higher, it might be possible to train a single universal network from videos of many different people, which could be further conditioned on individual speakers. For say, the AI-based system could be fed with a small video sample of the new person, to produce accurate mouth shapes for that person.
Provide your comments below