Object detection systems usually employ bounding boxes, pixel resampling and application of high-quality classifiers. These approaches are heavy on computation and high on latency when it comes to real-time applications.
Single shot detection, unlike other object detectors, doesn’t resample pixels or features for bounding boxes.
Object Detection With SSD
By eliminating the bounding boxes approach, SSD (single shot detector) brings a lot of improvement with regards to the speed at which the computer vision tasks are carried out.
Single shot detection approach uses a small convolutional filter to predict object categories and these filters are used to multiply feature maps to perform detection at multiple scales.
This results in high-accuracy detection even in low-resolution images.
The category scores for a fixed default bounding boxes are predicted using small convolutional filters and are then applied to feature maps.
The above figure illustrates the working of SSD. The two animals in the above picture, a cat on the left side and a dog on the right are marked with blue and red bounding boxes which are ground truth boxes for each object. Now each location is evaluated in a convolutional fashion with different scales 8 x 8 and 4 x 4.
Every default box is checked for shape offsets and confidences (conf) regarding categories of the object.
The loss is the model is a weighted sum between localisation loss (such as Smooth L1) and confidence loss (like Softmax).
In case of SSD, the ground truth boxes information needs to be provided with specific outputs unlike in other detectors where region proposals are used before a final classifier.
The ground truth box is matched with the default box using jaccard overlap, which ensures that each ground truth box has exactly one matched default box.
This enables the network to predict high confidences for multiple overlapping default boxes (black dotted lines in the above figure) instead of picking only one with maximum overlap.
This approach is similar to that of MultiBox apart from its multiple object handling.
Feature maps, be it 8 x 8 or 4 x 4, have different receptive field sizes. With SSD, default boxes do not have to deal with these receptive fields. Instead, specific feature map locations can be taught to be responsive to specific areas in the image corresponding to scales of the objects.
Suppose an object, say dog, in this context, has been matched in the 4 x 4 but not in the 8 x 8 because of default boxes different scales. These unmatched ones are considered as negatives during training.
So this process leads to a lot of negatives which in turn creates an imbalance between positive and negative training examples.
To balance this, the default box with the highest confidence is picked so that the ratio between negatives and positives is 3:1.
Modelling With SSD
Step-by-step procedure:
- A feedforward CNN produces a fixed-size collection of bounding boxes and prediction scores with respect to a certain object class.
- A non-maximum suppression step is performed for final detections.
- An auxiliary structure is added to the network to detect features at multi-scale. This structure includes multi-scale feature maps for detection, convolutional predictors, default boxes and aspect ratios.
SSD is sensitive to the size of the bounding boxes. Its performance drops with a decrease in the size of the objects. Bigger the better. SSD is similar to regional proposal network (RPN) in Faster R-CNN when it comes to using default boxes which are anchor boxes in RPN. But, SSD uses scores for each object category in each box.
Given the same VGG-16 base architecture, SSD does well as compared to other object detectors (YOLO and Faster R-CNN) in both speed and accuracy.
Read more about SSD here