Resnet - Deep Residual Learning for Image Recognition

Introduction

The Background for proposing this Neural Network model, was the challenge of implementing deeper CNNs to achieve better classification performance. Deeper CNNs resulted with improved performance. This is valid for a various of computer vision tasks such as recognition, detction, segmentation etc. On the other hand, when getting much deeper, problems such as vanishing/exploding gradients become more significant, with symptoms such as growing errors and accuracy degradation.

In their paper from 2015, “Deep Residual Learning for Image Recognition”, Kaiming He, Xiangyu Zhang, Shaoqing Ren & Jian Sun proposed a model which enables the training very deep networks - the Resnet. The paper demonstrates Resnet-34, Resnet-50, Resnet-101 and Resnet-152, which deploy 34, 50, 101 and 152 parametric layers respectively.

Resnet Functional Description

Resnet building block is called a Residual Block which contains a skip connection, aka a short-cut as depicted in Figure 1 and described next.

Figure 1: Resnet Residual Block

Resnet Residual Block

As shown by Figure 1, the skip is added to plain connection just before the Relu module, thus providing nolinearity effect to the sum. Note that each conv element is followed by a back normalization module and a ReLu activation module. Note - the values of the stride s and the down-scale parameter d are discussed later.

A Resnet network stacks Residual Blocks back to back, as illustrated by Figure 2.

Figure 2: Resnet-18

Resnet Residual Block

Figure 2 depicts a Resnet-18 CNN, named so for the 18 layers. The last layer is an FC (fully connected) layer, which receives the flatten array of data and outputs N recognition classes through a softmax module. Resnet was evaluatedwith ImageNet 2012 classification dataset that consists 1000 classes.

Bottle-neck Resnet

Obviously, yhe deeper the resnet is, the more computation operations are needed - those are counted as FLOPS, where each FLOP is an add/multiply operation. To cope with that, deeper networks, from Resnet50 and up, replace the Resnet Residual block by a Bottlenek Resnet block, as depicted by Figure 3.

Figure 3: Bottlenek Resnet Block

Resnet Residual Block

The bottleneck block consists of 3 conv blocks: A single conv 3 x 3, wrapped between 2 conv 1 x 1 blocks. Comparing to the basic Residual Block, this one uses a single conv 3 x 3 module. Note that the conv 3 x 3 requires 9 times more FLOPS than a conv 1 x 1. The conv 3 x 3 is indeed named the bottleneck. The front conv 1 x 1 module scales dimenssions down, to offload the bottleneck, while the conv 1 x 1 scales it up.

Resnet Architectures Table

Table below presents a summerised architecture description of 5 Resnets, presented in the Resnet paper.

Input image size is taken as 224x224 - Anyway, considering the overall image downsizing by 32, both height and width should be a multiple of 32.

The table is copied from the Resnet paper almost as is.

layer_name output size strides Resnet18 Resnet34 Resnet50 Resnet101 Resnet152
conv1 112x112 2 7x7,64
conv2.x 56x56 2 max pool 3x3
1 or 2 see note (1) Repeat x2:

3x3,64
3x3,64
Repeat x3:

3x3,64
3x3,64
Repeat x3:

1x1,64
3x3,64
1x1,256
Repeat x3:

3x3,64
3x3,64
3x3,256
Repeat x3:

3x3,64
3x3,64
3x3,256
conv3.x 28x28 1 or 2 see note (1)
Repeat x2:

3x3,128
3x3,128
Repeat x4:

3x3,128
3x3,128
Repeat x4:

1x1,128
3x3,128
1x1,512
Repeat x4:

1x1,128
3x3,128
1x1,512
Repeat x8:

1x1,128
3x3,128
3x3,512
conv4.x 14x14 1 or 2 see note (1) Repeat x2:

3x3,256
3x3,256
Repeat x6:

3x3,256
3x3,256
Repeat x6:

1x1,256
3x3,256
1x1,1024
Repeat x23:

1x1,256
3x3,256
1x1,1024
Repeat x36:

1x1,256
3x3,256
1x1,1024
conv5.x 7x7 1 or 2 see note (1) Repeat x2:

3x3,512
3x3,512
Repeat x3:

3x3,512
3x3,512
Repeat x3:

1x1,512
3x3,512
1x1,2048
Repeat x3:

1x1,512
3x3,512
1x1,2048
Repeat x3:

1x1,512
3x3,512
1x1,2048

FC
Global average pool, 1000 classes, softmax
1000 classes + softmax
FLOPS 1.8x10e9 3.6x10e9 3.8x10e9 7.6x10e9 11.3x10e9

Figure below pillustrates a Resnet-34block diagram.

Figure 4: Resnet-34 block diagram

Resnet Residual Block

Dimenssions Matching

The skip connection described above bypassed the processing block directly. This path is known as the Identity Block. However simple, this skip connection may need a dimenssion adaptation to match the direct connection’s dimennsions, if it had been modified. Two methods were proposed to solve that:

  1. A Skip Block - add extra zero padding to the skip data.
  2. A Convolutioal block - add a projection 1 x 1 kernel.

Figures below the proposed solutions illustrated on both Residual Block and the Bottleneck Block.

Figure 5a: Residual Block Skip Dimensions Fix with Zero Padding

Resnet Residual Block

Figure 5b: Residual Block Skip Dimensions Fix with Projection

Resnet Residual Block

Figure 5c: Bottleneck Skip Dimensions Fix with Zero Padding

Resnet Residual Block

Figure 5c: Bottleneck Skip Dimensions Fix with Projection

Resnet Residual Block

Kaiming He et al showed that the projection shortcuts gave better results in terms of top-1 and top-5 error rates (btw, top-5 error rate is the fraction of test images for which none of the five most probable results was the correct lable). On the other hand, padding is needs less coputations, and no extra parameters. The small difference in results made thauthors take the padded skips in their tests, to reduce computation and memory requirements. means that than the simple padding solution.

Considering dimensions matching, here’s a modified Resnet34 block diagram (Figure 4): Now the Residual blocks are indicated as either Identity blockd - or Projection blocks (which could alternatively be Zero Padding blocks).

Figure 6: Resnet-34 block diagram - Indication of Residual blocks types

Resnet Residual Block

Notes on Vanishing Gradient problem

A brief reminder of vanishing gradient problem - during the backpropagation, the network’s weights parameters are gradually updated to minimize the loss function, according to Gradient Descent algorithm - or any of its variants.

Here’s the Gradient Descent update equation, for uptating the weights of the kth layer at time \(i+1\):

$w_{i+1}^k=w_i^k+\alpha * \frac{d L}{dw_i^k} $

Where:

  • \(w_{i+1}^k\) expresses the kth layer’s weights at time \(i+1\).
  • \(\alpha\) is the learning rate
  • \(\frac{d L}{dw_i^k}\) is the Loss gradient with respect to the weight at time \(i+1\).

The gradients are calculated using the derivative chain rule, where the optimization calculation is executed in a back propogating manner, starting from layer k=L, back till k=1.

Let’s illustrate the back propogation on a residual block, i.e. the Loss derivative with respect to \(a^{k}\), given \(\frac{dL}{da^{(k+2)}}\).

See Diagram below, which is followed by the xhin rule detivative expression.

Resnet Residual Block

According to chain rule:

\(\frac{dL}{da^{(k)}} = \frac{dL}{da^{(k+2)}}\frac{da^{(k+2)}}{da^{(k)}}\)

Consider that:

\(a^{(k+2)}} = g(F(a^{(k)}) + a^{(k)})

Where g(x) is the activation function, ReLu in this example. ReLu is linear for positive arguments and zero otherwise, so let’s concentrate on the nonw zero case.

In that case, the derivative of the skipped component is 1, so this component protects against vanishing gradient problem.

Besides improving vanishing gradients issue, the skip connections carry lower level information from initial layers which correspnd to lower level features. Adding this information to the higher level more abstract information extracted by the layers which follow, contributes to better performance.

+Note - check this: https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33