MITB Banner

Using A Unique Neural Network Framework For Visual Question Answering

Share

Over the last few years, niche areas in artificial intelligence such as computer vision (CV) and natural language processing (NLP), have seen tremendous growth. This can be attributed to the fact that the nature of research has improved greatly in this field. Although research in AI gathers insights from various disciplines, in case of CV and NLP, there haven’t been sufficient methods to determine images and text together (known as ‘image captioning’). Apart from this, AI implementation needs a standard metric for monitoring progress, which is a tough challenge.

In this article, we will explore Visual Question Answering (VQA) system used to set a response for images, and how it is made better with a unique neural network framework known as end-to-end module network.

Information From Images

In a VQA system, pictures or images and natural language questions are provided as input data. The system gives a natural language answer as output in response to the input. This requires a lot of data to be tested in the system for it to be fully AI-capable. The natural language questions in a VQA system are usually based on a variety of features, such as object detection and identifying activities based on common sense and knowledge.

Deriving questions from pictures (Image courtesy : VQA, research paper by Stanislaw Antol et.al)

Lately, research for improving VQAs has been garnering a lot of attention. Right from harbouring information from a large dataset to the use of recurrent neural networks (RNN) and convolutional neural networks (CNN), VQA has witnessed many modifications. Now, researchers at University of California, Berkeley, in collaboration with Facebook and Boston University have proposed a novel neural network framework called End-To-End Module Network, which is supposed to speed up VQA.

End-to-End Module Networks

According to researchers, these unique networks aim to solve VQA tasks by analysing a class of models which predict modular network architectures, serve them as a source of text which is then applied to images considered in the project. In addition, they use a parser to understand the textual information for building neural network layouts.

Their neural network model has two components. The first one is a set of modules called ‘co-attentive neural modules’ which have parameterised functions for solving sub-tasks. The second component is a layout policy which creates individual neural layouts to provide responses based on the questions encountered in the VQA system.

The ‘co-attentive neural modules’ in the model is constructed into a neural network. These modules consider input in the form of tensors by absorbing features from the image and text input, and then give out a unique tensor as the output. In the study, every input tensor was an image attention map placed on a convolutional feature grid and consequently, the output tensor is either the attention map or a probability distribution spread across the answers collected in the context. Therefore, a total of nine modules are studied to extract text and image features.

In order to provide the best appropriate reasoning for the questions, a layout policy is determined with the help of a sequence-to-sequence recurrent neural network. This policy gives output in the form of a probability distribution and builds a layout. Lastly, a neural network is built by combining the neural modules and outputs from these layout policies.

However, the neural networks are to be trained to give out meaningful output. So, a prior training step is included in the process at the very end. The training is also done to estimate a loss function from the layout policies. The loss function, described by the researchers, is given below:

“Let θ be all the parameters in our model. Suppose we obtain a layout l sampled from p(l|q; θ) and receive a final question answering loss L˜(θ, l; q, I) on question q and image I after predicting an answer using the network assembled with l. Our training loss function L(θ) is as follows.

L(θ) = El∼p(l|q;θ) [L˜(θ, l; q, I)]

where we use the softmax loss over the output answer scores as L˜(θ, l; q, I) in our implementation.”

The loss is reduced significantly by introducing another variable called baseline in place of gradient formed from the loss function. Researchers have pointed out that optimising loss is a challenge and requires constant learning from the parameters in the VQA.

Now, the fully-built model is tested on three datasets in total. Firstly, with a small dataset known as SHAPES dataset and then on to larger datasets, CLEVR and VGA. The performance of the model on all these datasets was found to be very satisfactory (close to 90 percent of the reasoning to be accurate for the questions).

Conclusion

Although the model has achieved considerable success, it is yet to come out as a standard way to address visual and text data simultaneously for AI systems. Nonetheless, with advancements like these, VQA systems will soon have neural networks like these powering them to be fully AI-capable.

PS: The story was written using a keyboard.
Share
Picture of Abhishek Sharma

Abhishek Sharma

I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.