Facial recognition market is going to be a $7.6 billion market by 2022. There is a huge opportunity waiting for whoever builds great proprietary technology using lesser computational resources. As of now Apple and Amazon seem to be winning the race to build fast and efficient facial recognition systems. However, Chinese researchers Sheng Chen, Yang Liu, Xiang Gao, and Zhen Han have now come up with a “light-weight” facial recognition network, called the MobileFaceNet.
Facial verification is also a very important identity authentication technology. It is being used in more and more mobile phones and applications — such as for unlocking a device or mobile payment platform, among others. To achieve maximum user-friendliness with limited computation resources, the facial verification models deployed locally on mobile devices are expected to be not only accurate but also small and fast.
Performance And Size
MobileFaceNet is a neural network and obtains accuracy upto 99.28 percent on labelled faces in the wild (LFW) dataset, and a 93.05 percent accuracy on recognising faces in the AgeDB dataset. The network used around a million parameters taking only 24 milliseconds to run and produce results on a Qualcomm Snapdragon processor. We can compare this performance to accuracies of 98.70 percent and 89.27 percent for ShuffleNet, which has many more parameters and takes a little longer to execute on the CPU.
The researchers have made it easy to replace the global average pooling layer in the CNN with a depthwise convolution layer, which improves performance on facial recognition. This development is really important as the artificial intelligence world searches for efficient models that run on small compute powers which are available on today’s mobile phones.
Another approach for obtaining lightweight facial verification models is by compressing pretrained networks by knowledge distillation. Such approaches have achieved 97.32 percent facial verification accuracy on LFW with 4.0 MB model size. The remarkable achievement is that MobileFaceNets achieves comparable accuracy with very small budget.
In most state-of-the-art mobile neural networks for visual recognition tasks, global average pooling layers are involved. The pooling layer is used to continuously reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network. For example, the neural network models MobileNetV1, ShuffleNet, and MobileNetV2, which are some of the most successful facial verification and recognition approaches, have a global pooling layer.
Researchers have also observed that CNNs with global average pooling layers are less accurate than those without global average pooling. However, there is no theoretical proof or analysis for this phenomenon. The researchers make a simple analysis on this phenomenon: A typical deep facial verification pipeline includes preprocessing facial images, extracting facial features by a trained deep model, and matching two faces by their features’ similarity or distance.
This the global average pooling layer can be replaced by with a fully connected layer to project a compact face feature vector. But this approach adds large number of parameters to the whole model. This is not very desirable since the main pursuit is to design a model with minimum parameters.
MobileFaceNet architecture is partly inspired by the MobileNetV2 architecture. The residual bottlenecks proposed in MobileNetV2 are used as our main building blocks. The researchers use PReLU as the non-linearity, which is better suited for facial verification than using ReLU. The researchers also use a fast downsampling strategy at the beginning of the network, and a linear 1×1 convolution layer following a linear global depthwise convolution layer as the feature output layer. The detailed architecture is mentioned in the table below:
The primary MobileFaceNet network uses 0.99 million parameters. To reduce computational cost, the researchers decided to change input resolution from 112×112 to 112×96 or 96×96. The the linear 1×1 convolution layer after the linear GDConv layer was also removed from MobileFaceNet. This gives a resulting network called MobileFaceNet-M.
The researchers have used MobileNetV1, ShuffleNet, and MobileNetV2 as the baseline models. All MobileFaceNet models and baseline models are trained on CASIA-Webface dataset from scratch by ArcFace loss, for a fair performance comparison among them. The training is finished at 60K iterations.
To pursue further excellent performance, MobileFaceNet, MobileFaceNet (112×96), and MobileFaceNet (96×96) are also trained on the cleaned training set of MSCeleb-1M database with 3.8 million images from 85,000 subjects. The accuracy of our primary MobileFaceNet is boosted to 99.55 percent and 96.07 percent on LFW and AgeDB-30, respectively.