The guiding principle of generative models is being able to construct a convincing example of the data that it is fed with. The more realistic the image, the stronger is the evidence that the model has grasped the objective.
Generative models offer an appealing alternative to self-supervised tasks in that they are trained to model the full data distribution without requiring any modification of the original data.
The dream of generations to finding actionable insights from raw data alone has hardly been realised yet. So far, self-supervision has dominated the representation learning in spite of the success of GANs.
The simplest objective for unsupervised learning is to train an algorithm to generate its own instances of data. The so-called generative models should not simply reproduce the data they are trained on, but rather build a model of the underlying class from which that data was drawn. For example, they should not show a particular photograph of a horse or a rainbow, but the set of all photographs of horses and rainbows; not a specific utterance from a specific speaker, but the general distribution of spoken utterances.
Bi-directional Generative Adversarial Networks (BiGANs) were introduced a couple of years ago to learn inverse mapping, and demonstrate that the resulting learned feature representation is useful for auxiliary supervised discrimination tasks, which are on par with unsupervised and self-supervised feature learning.
Intuitively, models trained to predict these semantic latent representations given data may serve as useful feature representations for auxiliary problems where semantics are relevant.
Now researchers introduce BigBiGAN which is built upon the state-of-the-art BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator.
What BigBiGAN Does To Visual Models
The architecture for BigBiGAN remains the same as that of BiGAN. The only change here is that the researchers have found that an improved discriminator structure leads to better representation learning results without compromising generation.
The above figure is the structure of the BigBiGAN framework where a joint discriminator D is used to compute the loss. Its inputs are data-latent pairs, either (x∼Px,ˆz∼ E (x)), sampled from the data distribution Px and encoder E outputs, or (ˆx∼G(z),z∼Pz), sampled from the generator G outputs and the latent distribution Pz. The loss includes the unary data term Sx and the unary latent term Sz, as well as the joint term Sxz which ties the data and latent distributions.
BigBiGAN is trained on an unlabeled ImageNet. Its learned representation is later frozen and then a linear classifier is trained on its outputs, fully supervised using all of the training set labels.
In the above figure, top row images are real data; bottom row images are generated reconstructions of the above image. Unlike most explicit reconstruction costs (e.g., pixel-wise), the reconstruction cost implicitly minimized by a (Big)BiGAN tends to emphasize more semantic, high-level details.
The extent to which these reconstructions tend to retain the high-level semantics of the inputs rather than the low-level details suggests that BigBiGAN training encourages the encoder to model the former more so than the latter.
According to the authors, the following are the key objectives behind this work:
- Show that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in unsupervised representation learning on ImageNet.
- Proposal of a more stable version of the joint discriminator for BigBiGAN.
- Perform a thorough empirical analysis and ablation study of model design choices.
- Show that the representation learning objective also helps unconditional image generation.
AI With Intuition: The Final Frontier
The ability to learn about the world without explicit supervision is fundamental to what is regarded as intelligence. And, the above results do resonate with the popular intuitions about the human mind.
Though the reconstructions from the generative models are far from pixel-perfect, they still may provide some intuition for what features the encoder learns to model.
For example, when the input image contains a dog, person, or a food item, the reconstruction is often a different instance of the same “category” with similar pose, position, and texture – for example, a similar species of dog facing the same direction.
In this work, the researchers introduce BigBiGAN and show that progress in image generation quality translates to substantially improved representation learning performance. This is a new perspective considering how ambiguous the inner workings of deep networks are.
Read the full work here.