Meet ViLBERT, The Task-Agnostic Model Inspired From BERT For Vision Grounding

Share

Published on August 14, 2019

by Ambika Choudhury

Computer and human interaction are one of the crucial reasons for the rapid evolution of emerging technologies. In this domain, artificial intelligence and natural language processing (NLP) is helping to bridge the gap between all these tasks. There has been considerable research into systems that mine images or other visual backgrounds and are able to demonstrate the visual understanding.

Machine Learning applications have touched new heights over the past few years. Researchers have been working to make machine translation more ubiquitous – this research has spawned open-sourced pre-trained NLP models like Google’s BERT, OpenAI’s GPT-2, ULMFiT. These models support various NLP applications like sentiment analysis, chatbots, and much more.

Recently, researchers from Georgia Institute of Technology, Facebook AI Research and Oregon State University have developed a model known as ViLBERT, short for Vision-and-Language BERT. This model is built to learn task-agnostic joint representations of image content as well as natural language.

Behind the Model

In one of our articles, we already discussed Google’s BERT and how it is going to change the way of using the techniques of natural language processing. ViLBERT consists of two parallel BERT-style models operating over image regions and text segments. Here, each stream is a series of transformer blocks (TRM) and novel co-attentional transformer layers (Co-TRM) which is introduced mainly to enable information exchange between modalities.

If an architecture is modelled with making minimal changes to BERT, in that case, the architecture may suffer from a number of drawbacks such as treating inputs from both modalities identically, ignoring that they may need different levels of processing due to either their inherent complexity of the initial level of abstraction of their input representations, forcing the pre-trained weights to accommodate the large set of additional visual ‘tokens’ may damage the learned BERT language model, etc. For these reasons, the architecture of ViLBERT is created by developing a two-stream architecture modelling where each modality is separate and fusing them through a small set of attention-based interactions.

Fig: Architecture of ViLBERT model.

Algorithms like Faster R-CNN is used with Res-Net is used which is pre-trained on the Visual Genome dataset in order to extract region features. The above image shows ViLBERT model, where the green stream denotes the visual processing and purple stream denotes the linguistic processing which is parallel to each other and interacts through novel co-attentional transformer layers.

Dataset Used

In order to train the VilBERT model, Conceptual Captions dataset has been used. In this dataset, there is a collection of approximately 3.3 million with weakly-associated descriptive image-caption pairs automatically scraped from alt-text enabled images on the web.

Why Use This Model

BERT is one of the most popular self-supervised pre-trained models for natural language processing and has set a benchmark in the state-of-the-art NLP tasks. It uses WordPiece embeddings with a 30,000 token vocabulary to explore the unsupervised pre-training of natural language understanding systems. ViLBERT can be called as an extension model of BERT for learning task-agnostic visual grounding from paired visio-linguistic data.

Outlook

NLP too is evolving in a fast manner and is becoming more robust in nature. In the current scenario, the use of computer vision with natural language processing has set a benchmark in the emerging technologies and this is the reason organisations are implementing this technology to the emerging intelligent models.

Read the paper here.

Access all our open Survey & Awards Nomination forms in one place