MITB Banner

Now A Model That Learns Visual Concepts, Words And Semantic Parsing Of Sentences Without Explicit Supervision

Share

Building a computer system that can answer by just looking at the images has been in the works for many years now. Researchers in the field of artificial intelligence are working on this concept to make the current systems more intelligent than ever. Fusing this with natural language processing will not only reduce the time for training a model but will also help in predicting the emotions behind the objects.

In May 2019, researchers from IBM, DeepMind and MIT developed a deep learning model known as the Neuro-Symbolic Concept Learner. It is closely related to the joint learning of vision and natural language processing. This deep learning model learns from natural supervision such as visual perception, words, and semantic language parsing from images and question-answer pairs. 

How It Works

The Neuro-Symbolic Concept Learner uses the techniques of artificial neural networks in order to extract features from images and construct information as symbols. Then a quasi-symbolic program executor is applied to the model to infer the answer of questions which is based on the scene representation.  

For visual perception, the researchers used a pre-trained Mask R-CNN technique in order to generate object proposals for all objects. Then the method of Res-Net is applied to extract the region-based and image-based features. For translating the natural language questions into executable programs basically designed for visual question answering (VQA), a semantic parsing module is applied. 

Behind the Model

The researchers used a dataset known as CLEVR to test the deep learning model. This is a diagnostic dataset that helps in solving visual question answering (VQA). The model learns via curriculum learning such that it starts by learning representations or concepts of the individual objects from short questions on simple scenes. This helps the model to learn object-based concepts such as colours and shapes. The model then learns the relational concepts by leveraging these object-based concepts to interpret object referrals. 

Moreover, the model naturally learns the disentangled visual and language concepts, enabling combinatorial generalisation with respect to both visual scenes and semantic programs. In this case, there are four forms of generalisation, the model first generalises to scenes with more objects and longer semantic programs than those in the training set. Secondly, the model generalises to new visual attribute compositions. Thirdly, it generalises to enable fast adaptation to novel visual concepts and finally, the learned visual concepts transfer to new tasks such as image-caption retrieval.

NS-CL contains three modules as mentioned below

  • Neural-Based Perception Module: This module works by extracting the object-level representations from the scene.
  • Visually-Grounded Semantic Parser: This module works for translating questions into executable programs. 
  • Symbolic Program Executor: This module reads out the perceptual representation of objects, classifies their attributes or relations, and executes the program to obtain an answer. 

Advantages of NS-CL

  • This deep learning model learns visual concepts with great accuracy rate. The researchers gained a classification accuracy of nearly 99% for all object properties.
  • This model allows data-efficient visual reasoning on the CLEVR dataset.
  • NS_CL generalises well to new attributes, new visual composition as well as new domain-specific languages.
  • This model can be directly applied to visual question answering (VQA).

OutLook

In order to train deep learning models, it is necessary to have a large amount of data, what we called as big data. It is indeed a complex task to collect big data and work on it. However, organisations are collecting data every day and from this massive number of collected data, organisations are able to utilise only a little chunk from it. This is one of the main reasons why this model is developed. For future work, the researchers would like to extend this framework to other domains such as video understanding and robotic manipulation. 

You can read the full paper here.

PS: The story was written using a keyboard.
Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India