Deep convolutional networks have led to remarkable breakthroughs for image classification. Driven by the significance of convolutional neural network, the residual network (ResNet) was created. ResNet was designed by Kaiming He in 2015 in a paper titled *Deep Residual Learning for Image Recognition*. In this paper, he discussed a model built by his team which bagged the ImageNet challenges in all the domains such as classification, detection, and localisation.

## ResNet: What Does ‘Residual’ Mean?

Residual is nothing but the error.

Let us say that you are asked to predict the weight of ten apples, just by looking at them. If the actual weight of those apples is 3 kilos, and you have predicted it as 4 kilos, then the residual is -1 (kilos). Or, if you predicted the weight of the apples as 1 kilo, then the residual in this case would be +2. Therefore, the residual is the amount or number by which you have to change your prediction to meet the actual value.

This can be represented in a small flowchart:

Here, X is our prediction and we want the value to be equal to the Actual value. Since it is off by a small margin, the residual function residual() will compute and produce the residual of the model to match the predicted value with the Actual value. When or if ** **X = Actual, then the function residual(X) will be zero. The identity function just copies the value X.

## What Does ResNet Do

The main goal of the residual network is to build a deeper neural network. We can have two intuitions based on this:

- As we keep going deeper into implementing large amount of layers, one should make sure not to degrade the accuracy and error rate. This can be handled by identity mapping.
- Keep learning the residuals to match the predicted with the actual

These are the functions of a Residual Network.

Here is what the paper says about Identity Function and it is implemented.

y = F(x, {Wi}) + x —– (1)

x and y are the input and output vectors of the layers considered. The function F(x, {Wi}) represents the residual mapping to be learned. For the example in Fig that has two layers, F = W2σ(W1x) in which σ denotes ReLU.

y = F(x, {Wi}) + Wsx —– (2)

We can also use a square matrix Ws in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.

### What Is A Plain Neural Network

The convolutional layers mostly have 3×3 filters and follow two simple design rules:

- For the same output feature map size, the layers have the same number of filters
- If the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer.

Downsampling is performed directly on convolutional layers which have a stride of 2. The network ends with a global average pooling layer and a 1,000-way fully-connected layer with softmax.

### Residual Neural Network

The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions. When the dimensions increase we consider two options

- The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter
- The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

During training stage the residual network alters the weights until the output is equivalent to the identity function. In the process the outcome of residual function eventually becomes 0 and X gets mapped onto the hidden layers. Therefore, the error correction is not required. In turn the identity function helps in building a deeper network. The residual function then maps the identity, weights and biases to fit the actual value.

### Implementation

The ImageNet 2012 classification dataset consists of 1,000 classes. The current model has been trained over 1.28 million images, and evaluated on 50,000 validation images and finally tested on 100,000 test images.

##### Experiment on ImageNet

**Plain Networks**: The 18-layer network is evaluated first and then the 34-layer is evaluated. Both the 18-layer and 34-layer are plain net of a similar form. The results show that deeper the 34-layer plain network is higher is the validation error than the 18-layer shallower plain net. To understand the reasons, the training and the validation errors during the procedure is compared – 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.

It has been argued that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with Batch Normalization, which ensures forward propagated signals to have non-zero variances. Also verified that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals have vanished. In fact, the 34-layer plain net is still able to achieve competitive accuracy, suggesting that the solver works to some extent.

**Residual Network: **The 18-layer and 34-layer ResNets have been evaluated. The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters. In the first comparison, the identity mapping for all shortcuts and zero-padding for increasing dimensions are computed. So they have no extra parameter compared to the plain counterparts. There are three major observations. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalisable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.Second, compared to its plain counterpart, the 34-layer parameter-free, identity shortcuts help with training. Next to investigate projection shortcuts are compared to three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter free (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity and (C) all shortcuts are projections.

# Conclusion

ResNets are being implemented in almost all of AI’s new tech to create state-of-the-art systems. The principle on which ResNets work is to build a deeper networks compared to other plain networks and simultaneously find a optimised number of layers to negate the vanishing gradient problem. ResNeXt have also been deployed on CIFAR-10 dataset, the results are remarkable and its architecture gave a top-5 error rate with 3.30% thus winning second position in ILSVRC competition.