The primary objective of unsupervised learning is to train an algorithm to generate its own instances of data. The job here is not to just simply reproduce from the training data but to build a model of the underlying class from which that data has been drawn from. For example, they should not show a particular photograph of a horse or a rainbow, but the set of all photographs of horses and rainbows.
Generative models have been generating images with high fidelity but when it comes to videos, the results weren’t that impressive. Now the researchers look to transition the image generation capability into generating high-resolution videos.
They propose a model called Dual Video Discriminator GAN (DVD-GAN), which scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator.
DVD-GAN employs two discriminators for its assessment:
- a Spatial Discriminator D_S and
- a Temporal Discriminator D_T
D_S critiques single frame content and structure by randomly sampling k full-resolution frames and processing them individually.
Whereas, the temporal discriminator D_T provides generator G with the learning signal to generate movement.
So far the GANs are known for their duelling nature. With DVD-GAN, the dual, as well as duel, has been put into use to generate high-quality videos out of thin air.
This model is trained on the complex Kinetics-600, which is a complex dataset of natural videos.
The videos in the dataset are known for their diversity and enable training of large models which alleviates the problem that usually occurs with smaller datasets.
Due to the complexities involved with increased data in case of videos, the generation has been restricted to simple datasets or where strong temporal conditioning information is available.
DVD-GAN which is built upon the state-of-the-art BigGAN architecture introduces a number of video-specific modifications including efficient separable attention and a spatio-temporal decomposition of the discriminator.
Bi-directional Generative Adversarial Networks (BiGANs) were introduced a couple of years ago to learn inverse mapping, and demonstrate that the resulting learned feature representation is useful for auxiliary supervised discrimination tasks, which are on par with unsupervised and self-supervised feature learning. DVD-GAN contains both self-attention and an RNN.
The generator in DVD-GAN contains no explicit priors for foreground, background or motion (optical flow). Optical flow is a mathematical approach to identify the motion of an object in a frame. This was originally modeled around how animals perceive their surroundings as they move.
An optical flow would give out the difference between frames by considering pixel intensities and other such attributes. For instance, if an object is getting brighter with every frame then it can be inferred that the object is not only moving but also coming closer as well.
This model, instead, relies on a high-capacity neural network to learn this in a data-driven manner.
Setting A New Benchmark For Video Generation
The above figure Selected frames from videos generated by a DVD-GAN trained on Kinetics-600 at 256 × 256,128 × 128, and 64 × 64 resolutions (top to bottom).
The resulting model, Dual Video Discriminator GAN (DVD-GAN), is able to generate temporally coherent, high-resolution video.
Each DVD-GAN was trained on slices of TPUv3 pods using between 32 and 512 replicas with an Adam optimizer for up to 300,000 update steps.
Future Video Prediction is the problem of generating a sequence of frames which directly follow from one (or a number) of initial conditioning frames.
Generating longer and larger videos is a more challenging modeling problem and DVD-GAN is able to generate plausible videos at all resolutions and with actions spanning up to 4 seconds (48 frames).
— roadrunner01 (@roadrunning01) July 16, 2019
This work has the following objectives according to the authors:
- Proposal of DVD-GAN – a scalable generative model of natural video which produces high-quality samples at resolutions up to 256 × 256 and lengths up to 48 frames.
- This model achieved state of the art for video synthesis on UCF-101 and prediction on Kinetics-600.
- Established a new benchmark for generative video modeling.
Know in detail about the video generative models here.