In the conventional object detectors, say, R-CNN, initially a set of object locations are generated and then these locations are classified whether they belong to the foreground or background classes using a CNN. This is working of a two-stage detector. In the case of one stage detectors like SSD, the accuracy is more when applied over dense sampling of object locations, scales and aspect ratio.
One-stage detectors generate a large set of object locations that densely cover few areas of the image. This creates a class imbalance as the negatives are increased and the object classes present in those locations go undetected.
RetinaNet was introduced by Facebook AI Research to tackle the dense detection problem.
Under The Hood Of RetinaNet
RetinaNet was introduced to fill in for the imbalances and inconsistencies of the single shot object detectors like YOLO and SSD while dealing with extreme foreground-background classes.
RetinaNet is designed to accommodate Focal Loss, a method to prevent negatives from clouding the detector.
The classification subnet predicts the probability of an object being present in a particular location.
The subnet is a kind of smaller version of fully convolutional networks(FCN) attached to each feature pyramid network(FPN) level.
An input feature map is taken from a given pyramid level and four 3 x 3 convolutional layers, followed by ReLU activations, and then by 3 x 3 convolutional layer.
Along with the classification subnet, a box regression subnet is attached to nullify the offset from each box to a nearby main object.
Negatives or background objects location are classified as a vector containing only zeros whereas, positives or foreground are classified by a one-hot vector. Assuming the prediction is a vector of all zeros but the target was a one-hot vector (in other words, a false negative), then the focal loss will evaluate to a large value for that anchor box.
Enhancement With Focal Loss
The loss function used in this approach is the loss of the output of classification subnet. This loss is applied to all the anchors in each sampled image.
Total focal loss of an image is the sum of the focal loss over all the anchors. The normalisation is done on the anchors assigned and not on the total anchors to avoid the negatives generated by overall anchors.
RetinaNet enabled by focal loss performs better than all existing methods, discounting the low-accuracy trend.
Initialization of RetinaNet needs a probability threshold(~0.01) for the anchor boxes. This probability is fed into the last convolutional layer of the classification subnet. This prior probability value indicates the ratio of foreground to background objects i.e positives to negatives. Hence this value is very significant.
This enhancement of using the focal loss in RetinaNet brings down the overall negatives in the output. The background is now more clearly distinguished from the foreground objects.
RetinaNet effectively improved a lot upon single-shot detection with its new training approach. Currently, there are few variants of RetinaNet, where the researchers introduce an adaptive loss function along with an instance mask prediction during training.
Read more about RetinaNet here.