Last updated May 28, 2018

Using A Unique Neural Network Framework For Visual Question Answering

Published on May 28, 2018

by Abhishek Sharma

Over the last few years, niche areas in artificial intelligence such as computer vision (CV) and natural language processing (NLP), have seen tremendous growth. This can be attributed to the fact that the nature of research has improved greatly in this field. Although research in AI gathers insights from various disciplines, in case of CV and NLP, there haven’t been sufficient methods to determine images and text together (known as ‘image captioning’). Apart from this, AI implementation needs a standard metric for monitoring progress, which is a tough challenge.

In this article, we will explore Visual Question Answering (VQA) system used to set a response for images, and how it is made better with a unique neural network framework known as end-to-end module network.

Information From Images

In a VQA system, pictures or images and natural language questions are provided as input data. The system gives a natural language answer as output in response to the input. This requires a lot of data to be tested in the system for it to be fully AI-capable. The natural language questions in a VQA system are usually based on a variety of features, such as object detection and identifying activities based on common sense and knowledge.

*Deriving questions from pictures (Image courtesy : VQA, research paper by Stanislaw Antol et.al)*

Lately, research for improving VQAs has been garnering a lot of attention. Right from harbouring information from a large dataset to the use of recurrent neural networks (RNN) and convolutional neural networks (CNN), VQA has witnessed many modifications. Now, researchers at University of California, Berkeley, in collaboration with Facebook and Boston University have proposed a novel neural network framework called End-To-End Module Network, which is supposed to speed up VQA.

End-to-End Module Networks

According to researchers, these unique networks aim to solve VQA tasks by analysing a class of models which predict modular network architectures, serve them as a source of text which is then applied to images considered in the project. In addition, they use a parser to understand the textual information for building neural network layouts.

Their neural network model has two components. The first one is a set of modules called ‘co-attentive neural modules’ which have parameterised functions for solving sub-tasks. The second component is a layout policy which creates individual neural layouts to provide responses based on the questions encountered in the VQA system.

The ‘co-attentive neural modules’ in the model is constructed into a neural network. These modules consider input in the form of tensors by absorbing features from the image and text input, and then give out a unique tensor as the output. In the study, every input tensor was an image attention map placed on a convolutional feature grid and consequently, the output tensor is either the attention map or a probability distribution spread across the answers collected in the context. Therefore, a total of nine modules are studied to extract text and image features.

In order to provide the best appropriate reasoning for the questions, a layout policy is determined with the help of a sequence-to-sequence recurrent neural network. This policy gives output in the form of a probability distribution and builds a layout. Lastly, a neural network is built by combining the neural modules and outputs from these layout policies.

However, the neural networks are to be trained to give out meaningful output. So, a prior training step is included in the process at the very end. The training is also done to estimate a loss function from the layout policies. The loss function, described by the researchers, is given below:

“Let θ be all the parameters in our model. Suppose we obtain a layout l sampled from p(l|q; θ) and receive a final question answering loss L˜(θ, l; q, I) on question q and image I after predicting an answer using the network assembled with l. Our training loss function L(θ) is as follows.

L(θ) = E_{l∼p(l|q;θ)} [L˜(θ, l; q, I)]

where we use the softmax loss over the output answer scores as L˜(θ, l; q, I) in our implementation.”

The loss is reduced significantly by introducing another variable called baseline in place of gradient formed from the loss function. Researchers have pointed out that optimising loss is a challenge and requires constant learning from the parameters in the VQA.

Now, the fully-built model is tested on three datasets in total. Firstly, with a small dataset known as SHAPES dataset and then on to larger datasets, CLEVR and VGA. The performance of the model on all these datasets was found to be very satisfactory (close to 90 percent of the reasoning to be accurate for the questions).

Conclusion

Although the model has achieved considerable success, it is yet to come out as a standard way to address visual and text data simultaneously for AI systems. Nonetheless, with advancements like these, VQA systems will soon have neural networks like these powering them to be fully AI-capable.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Abhishek Sharma

I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.