MITB Banner

Meet ViLBERT, The Task-Agnostic Model Inspired From BERT For Vision Grounding

Share

Computer and human interaction are one of the crucial reasons for the rapid evolution of emerging technologies. In this domain, artificial intelligence and natural language processing (NLP) is helping to bridge the gap between all these tasks. There has been considerable research into systems that mine images or other visual backgrounds and are able to demonstrate the visual understanding. 

Machine Learning applications have touched new heights over the past few years. Researchers have been working to make machine translation more ubiquitous – this research has spawned open-sourced pre-trained NLP models like Google’s BERT, OpenAI’s GPT-2, ULMFiT. These models support various NLP applications like sentiment analysis, chatbots, and much more.

Recently, researchers from Georgia Institute of Technology, Facebook AI Research and Oregon State University have developed a model known as ViLBERT, short for Vision-and-Language BERT. This model is built to learn task-agnostic joint representations of image content as well as natural language.

Behind the Model

In one of our articles, we already discussed Google’s BERT and how it is going to change the way of using the techniques of natural language processing. ViLBERT consists of two parallel BERT-style models operating over image regions and text segments. Here, each stream is a series of transformer blocks (TRM) and novel co-attentional transformer layers (Co-TRM) which is introduced mainly to enable information exchange between modalities.

If an architecture is modelled with making minimal changes to BERT, in that case, the architecture may suffer from a number of drawbacks such as treating inputs from both modalities identically, ignoring that they may need different levels of processing due to either their inherent complexity of the initial level of abstraction of their input representations, forcing the pre-trained weights to accommodate the large set of additional visual ‘tokens’ may damage the learned BERT language model, etc. For these reasons, the architecture of ViLBERT is created by developing a two-stream architecture modelling where each modality is separate and fusing them through a small set of attention-based interactions. 

Fig: Architecture of ViLBERT model.

Algorithms like Faster R-CNN is used with Res-Net is used which is pre-trained on the Visual Genome dataset in order to extract region features. The above image shows ViLBERT model, where the green stream denotes the visual processing and purple stream denotes the linguistic processing which is parallel to each other and interacts through novel co-attentional transformer layers.

Dataset Used

In order to train the VilBERT model, Conceptual Captions dataset has been used. In this dataset, there is a collection of approximately 3.3 million with weakly-associated descriptive image-caption pairs automatically scraped from alt-text enabled images on the web. 

Why Use This Model

BERT is one of the most popular self-supervised pre-trained models for natural language processing and has set a benchmark in the state-of-the-art NLP tasks. It uses WordPiece embeddings with a 30,000 token vocabulary to explore the unsupervised pre-training of natural language understanding systems. ViLBERT can be called as an extension model of BERT for learning task-agnostic visual grounding from paired visio-linguistic data. 

Outlook

NLP too is evolving in a fast manner and is becoming more robust in nature. In the current scenario, the use of computer vision with natural language processing has set a benchmark in the emerging technologies and this is the reason organisations are implementing this technology to the emerging intelligent models.

Read the paper here.

 

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.