Convolutional Neural Networks (CNNs) have become the leading method for classification and segmentation. In many cases, researchers also focus the attention of the neural network on a particular region in the image. This is known as the Region of Interest (RoI) inserted inside the net as a binary map. The region of interest pooling or better known as RoI pooling is widely used in object detection tasks using CNNs.
According to DeepSense.ai, it is used for detecting multiple cars and pedestrians in a single image. Its purpose is to perform maximum pooling on inputs of non-uniform sizes to obtain fixed-sized feature maps. Also, some of the major tasks in computer vision are object classification and object detection. In the first case, the system is supposed to correctly label the dominant object in an image. In the second case, it should provide correct labels and locations for all objects in an image.
Deep Learning Methods Based On CNN
If a researcher is developing an algorithm for self-driving cars and is interested in leveraging a camera to detect other cars, motorists and pedestrians — he/she will have to draw a box around every significant object and assign a class to it. This task is more challenging than usual classification tasks for MNIST or CIFAR. For example, for each frame of the video, there will be multiple objects, which are not clearly visible. Also, for these algorithms, performance has been cited as an issue. Especially, in cases of autonomous driving, researchers have to process tens of frames per second.
Now, RoI pooling is a neural net layer used for object detection tasks. It was first proposed by Ross Girshick in April 2015 and has sped up the training and testing methods. It also maintains a high detection accuracy.
In This Step, The Layer Takes Two Inputs:
a) A fixed-size feature map generated from a deep CNN with several convolutions and max-pooling layers.
b) An N x 5 matrix of representing a list of regions of interest, where N is a number of RoIs.The first column is representative of the image index while the rest of them are coordinates of the top left and bottom right corners of the region.
Here’s How It Works:
- When CNNs are used to identify an image for which RoI map is given as input, it searches for different kinds of features in the region.
- Then, the CNN extracts features from the images by convolving a set of different filters with the image to create a feature map for each filter
- Therefore, researchers apply a different set of filters to the background and RoI regions.
A New Approach To Saliency Maps
Now, a new research paper talks about a new technique to overcome the reliance on saliency maps, used widely to highlight the region of an image that influences the classifier’s decision the most. The paper from Montréal Institute for Learning Algorithms talks about a new approach to models as opposed to deep neural networks which are more expensive to train.
The paper discussed a guided backpropagation and SmoothGrad algorithms to obtain our saliency maps. For the purpose of improving the visual coherence of the maps, the researchers suggested smoothing them by iterating a combination of convolutions and thresholds.
Saliency maps, are also known as activation maps and are used to understand the influence of an individual pixel in the final classification. These maps can be generated by computing the gradient of the maximally responding output unit with respect to the input. Since the saliency maps based on this raw gradients are visually noisy, researchers prefer to implement a guided backpropagation algorithm with SmoothGrad, resulting in a more visually coherent map.
The researchers designed a convolution-based algorithm that computes iteratively a score (mean over a neighbourhood of pixels) and assigns a threshold for those values. If a pixel has the same value than its neighbours it will remain the same, otherwise, it gets the value of the other neighbours otherwise. In order to smooth all the points, we need to repeat this procedure several times depending on the filter and image size.
Try deep learning using MATLAB