Netflix shook the foundations of the entertainment industry when it released the interactive experience known as ‘Bandersnatch’ — the choose your own adventure series, which could very well serve as a blueprint for AI-powered interactive games. In the film, the viewer was given multiple choices as to how the story would progress, which were later revealed to be a set of narratives contained within a universe. The viewers were put into a “choose-your-own-adventure” novel.
The various choices made by users affected the actions of the characters on screen, thus blurring the boundaries between entertainment and interactivity. However, as the film was under real-world constraints, like being able to shoot a finite number of scenes, the outcomes of the story and the various branching paths were restricted.
This article aims to look into how something similar to Bandersnatch can be recreated using only Artificial Intelligence techniques. This also includes removing the aforementioned real-world constraints, thus allowing the creation to be truly open-ended. In this article, we take a step-by-step look at how AI can be used to recreate Bandersnatch.
It is worth noting that each of the building blocks of the movie require weeks or even months to be trained to receive optimum output. Therefore, it is assumed that computing power is far beyond what it is today, in order to deliver a seamless watching experience.
NLP-powered Parser: On the movie side of things, the ‘casting’, acting, scriptwriting and speech can be automated by AI. On the viewer’s end, an NLP-powered parser is required in order to convey information to a “master AI” of sorts. This directs the other AI programme on what to do based on the information given to the system by the user through the parser.
GAN to create faces: The faces of the ‘actors’ involved in the movie can be created through the use of a generative adversarial network or GAN. A GAN is a set of two neural networks, where one functions as a generator and the other as a discriminator. The generator creates the result, with the discriminator evaluating it.
This system was used by chipmaker Nvidia in order to demonstrate the generation of photorealistic faces. They utilised a new training method that progressively grows the generator and discriminator. This method involved adding new layers that model fine facial details as the training progresses, thus speeding the training up and “greatly stabilising” it.
Deep Generative Models: After a sufficient dataset is generated, it has to be brought together in a “video” format, with multiple pictures strung together. This can then be run through a system known as Deep Video Portraits, which employs a generative neural network with a space-time architecture. This will allow for the creation of natural-looking faces that can employ facial and lip movements.This is achieved through the prediction of photo-realistic video frames for a target based on synthetic renderings of a parametric face model
RNN for scriptwriting: As for the scriptwriting, AI have been generating text for years. More recently, a short film known as Sunspring was produced, which employed an AI known as Benjamin in order to write a science-fiction based script. A recurrent neural network can be utilised to generate readable and coherent text after many iterations of training. This can be achieved through character-level language models based on multi-layer long short-term memory units. RNN is fed with a dataset, then asked to model the probability distribution of the next character based on the previous sequence of characters.
WaveNet can mimic human speech: The characters’ speech can be simulated with an algorithm that employs a method similar to Google’s WaveNet algorithm. WaveNet is a deep generative model that is trained off raw audio waveforms. It can mimic human speech and can synthesise other audio signals. The raw audio waveform is modeled by the neural network, which is then trained by sampling the network to generated synthetic utterances. Following this, the probability distribution of the generations is computed. A value is then drawn and fed back into the input, thus achieving complex and realistic-sounding audio.
The technology exists today to make a complete film by utilising Image for representative purposes only. This undertaking, on the other hand, would fundamentally change the nature of entertainment. After the initial training, the algorithms can then be administered on control groups in order to flesh out the most common responses. While these responses would be the ones that are completely trained and debugged, the existence of an NLP parser means that the viewer can choose any option at will.
The finished product would then provide users with endless choice. Upon a new branch being opened, the collective AI, directed by the “master AI”, would generate a new pathway, complete with commands to each individual AI to generate their part of the puzzle. This would then result in an experience where users have the ultimate freedom of choice.