Researchers at the University of California, Berkeley and the Boston University have dug deep into the process of how deep neural networks generate ‘hallucinations’ while trying to create captions for images. By hallucinating we mean that neural networks often give strange outputs and predict weird results. Hence the researchers are now working on preventing this kind of hallucination that would pave the way to build better and more robust artificial intelligence systems.
There has been a great improvement in image processing performance as well as image captioning methods. The main drawback of these techniques is that they only measure similarity with the given training data and that there are no mechanisms to insert more context. That is why the researchers have proposed a new image relevance metric to measure the models with “veridical visual labels” and also measure the rate of object hallucination. The researchers use MSCOCO dataset for this purpose and use several models to create a benchmark. They claim to find out many interesting insights into the workings of these models.
Image Captioning And Hallucinations
There has been major research work in the field of image captioning. Neural Baby Talk (NBT) a research effort in image captioning, incorrectly generates the object “bench” for many images. This issue is referred to as object hallucination. One of the important application of image captioning is for visually impaired and blind people to understand the world and images around them. It was found in many research studies that visually impaired people prefer the correctness of the image caption rather than image coverage. And hence object hallucination is a very big worry and has the potential to cause harm to visually impaired people.
The problem of hallucination also directs us to another problem. The problem is that the models which hallucinate tend to build very incorrect internal representations of the image. The many questions researchers target are:
- Which models are more prone to hallucination?
- What are the likely causes of hallucination?
- How well do the standard metrics capture hallucination?
To answer all the above questions, the researchers analyse many captioning models and which neural architectures. The researchers proposed a new metric CHAIR (Caption Hallucination Assessment with Image Relevance), which tries to find the image relevance of the generated captions. Knowing that there might be many reasons for hallucinations, the researchers proposed image and language model consistency scores. These scores will go deeper into the issue which was a result of the language model. The researchers also underline that many metrics that the researchers depend upon do not capture and take into account the hallucination phenomenon.
Caption Hallucination Assessment
As mentioned above, the researchers created a new metric known as CHAIR (Caption Hallucination Assessment with Image Relevance). This calculates the ratio of words that are produced by models which are in the ground truth sentences and object segmentation. There are two flavours of the metric: per-instance, what fraction of object instances are hallucinated and per-sentence, what fraction of sentences include a hallucinated object.
Researchers selected around 80 MSCOCO objects and tokenised each sentence and singularise each word. The researchers said in the paper, “For each ground truth sentence, we determine a list of MSCOCO objects in the same way. The MSCOCO segmentation annotations are used by simply relying on the provided object labels.” The researchers found that the source of annotation is very important.
They used the sentence annotations to find ground truths and look into human biases of the annotations. Hence they found that using only segmentation labels or using only the reference captions leads to higher hallucination. The researchers also create a notion of image and language consistency. The image consistency talks about the “consistent errors from captioning model are with a model which predicts objects based on an image alone.” The language consistency talks about, “ how consistent errors from captioning model are with a model which predicts words based only on previously generated words”.
The researchers established many baseline models. They have covered many model architectures. They also consider models with and without attention mechanisms. The various models used for baselines use LSTM RNNs to output text. Most of the models learn via with the standard cross-entropy (CE) loss as well as the self-critical (SC) loss. As mentioned before, the researchers evaluate the captioning models on two MSCOCO splits.
The below table shows the result of baselining experiments. The test was done on the Karpathy Test set. There are clear cases where performance and hallucinations are not linked. For example, the NBT model does not perform as well as the top-down-BB model standard captioning metrics but has lower hallucination. Similarly, the researchers have done similar experiments on Robust Test set also. Please refer to the paper for more information.
In conclusion, the researchers found out various reasons for hallucinations and also that hallucination does not always agree with the results of standard captioning metrics. One of the important results was that neural networks with attention mechanisms have lower hallucinations but more of the credit goes to the convolutional features of the mechanisms. They also found strong visual representation is important to reduce hallucinations. In conclusion, the researchers suggest, “We argue that the design and training of captioning models should be guided not only by cross-entropy loss or standard sentence metrics but also by image relevance”.