A Guide To YOLOv3

!!!!!4

Introduction to Object Detection

The task of a CNN object detection model is dual: It provides both classifies objects within an image to dataset labels, and also provides an estimation to objects’ bounding boxes locations. The diagram below illustrates an input image on the left, and classification with bounding box annotations results on the right.

Animation: Image Class and Bounding Box Annotations

alt text

A detection model normally outputs 2 vectors per each detected object:

A Classification output vector, with estimated probabilities of each dataset label. Vector length is $N_classes$, i.e. number of classes in sdataset. Decision is mostly taken by applying a softmax operator on the vector.
A vector with the predicted location of a bounding box which encloses the object. The location can be represented in various formats as illustrated in the diagram below.

Representation Formats: (1) Bbox Vertices. (2) Bbox Center + Dimenssions:

alt text

Object Detection Models

YOLO was indeed an innovative break-through in the field of CNN Object Detection algorithims. This section briefly reviews 3 object detection algorithms before YOLO.

Plain CNN Model

This is a conventional CNN classification model, but classification output stage is now enhanced by a regression predictor for the prediction of a bounding box. Implementation is simple. However, it is limitted to a detection of a single object.

The illustrative diagram which follows presents an image with 3 shape objects. The model, at the best case, will detect one of the object shapes only. Such detection models, with a detection capability of a sinlge object are often reffered to as Object Localization models.

Figure: Plain CNN Model

alt text

Sliding Window Model

To address the single object detectio0n limitation, CNN is repeatedly activated inside the bounderies of a window, as it slides along the image as illustrated in the animation diagram below.

To fit various object sizes, multiple window sizes should be activated, as illustrated in the animation. Alternatively, (or in addition to), the sliding window should run over multiple scaleds of the image.

Location can be determined by window’s region, and the offset of the bounding box within the sliding window position.

Figure: Sliding Window Animation

alt text

There are drawbacks to this model: The repeated deployment of the CNN model per each window position, constraints a heavy computation load. But not only that - since convolution span regions are limitted by window’s size and position, which may be uncorrelated with image’s regions of interest postion and sizes, objects may be cropped or missed by the model.

R-CNN

R-CNN (by Ross Girshick et al, UC Berkely, 2014), which stands Regions with CNN features, addresses drawbacks of Sliding Window model. The idea of R-CNN in essence is of a 3 steps process:

Extract region proposals - 2000 regions were stated n original paper. The farther process is limitted to proposed regions. There are a number of algorithm which can make region proposals. The authors used selective search by J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders.Selective search for object recognition.IJCV, 2013.
Deploy CNN with bounding box regression over each proposed region.
Classify each region - originally using Linear SVM, in later model’s variants e.g. Fast R-CNN, Softmax was deployed.

Figure: Region Proposals

![alt text]/assets/images/images/yolo/image-classification-rcnn.jpg)

This is just a brief description of the algorithm, which at that time, contributed to a dramatic improvement of CNN detection models performance. R-CNN was later followed by improvments variants such as FASTR-CNN, Girshick, Ross. “Fast r-cnn.” Proceedings of the IEEE international conference on computer vision, 2015, FASTRR-CNN, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, 2016. These models aimed to address R-CNN problems, amongst are real time performance issues,long training time for the 2000 regions and region selection process.

A Brief Introduction to YOLOv3

This article is about YOLO (You Only Look Once), and specifically its 3rd version YOLOv3. YOLOv3: An Incremental Improvement, Joseph Redmon, Ali Farhadi, 2018

As presented above, the common practive in various algorithms prior to YOLO, was to run CNN ver many regions. This ptactices’ computation cost is high.

YOLO does segment the image as well. However, YOLO’s way of segmenting the image is entirely different: instead of running CNN over thousands of regions seperately, it runs CNN only once over the entire image. This is a huge difference, which makes YOLO so much faster - YOLOv3 is 1000 times faster than R-CNN.

So how can YOLO be so fast, while still segmenting the images?

Answer: YOLO functionally segments the image to a grid of cells. But rather than running CNN seperately on each cell, it runs CNN ones.

YOLO’s CNN predicts a detection descriptors to each grid cell. A descriptor combines 2 vectors:

A Classification result vector which holds a classification probabilty for each of the dataset’s classes.
The Bounding Box Location $x_1, x1, w, ,h$, along with the Objectness Prediction which indicates probabilty of an object resides in the Bbox.

Animation below illustrates Bounding Boxes parameters, which consist of the center location $c_x, c_y$, and Width and Height.

Gridded Image Animation: Center location $c_x, c_y$, Width and Height

alt text

Following is a diagram which illustrates a detection descriptor which YOLO predicts per a detected object. The descriptor consists of 5 words, 4 of which describe the bounding box location, then the Objectness probability, and then N classes probabilities.

Detection Descriptor

alt text

As a matter of fact, YOLOv3 supports not just a single detection per cell, but it supports 3 detections per a cell. Accordingly, YOLO’s predicted descriptor per cell is as illustrated in the diagram below.

Detection 3 Descriptors

alt text

So that was the output for a single grid cell. But CNN assigns such a detection descriptor to each of the grid cells. Considering a 13x13 grid, the CNN output looks like this:

YOLOv3 - CNN Output

alt text

Grid Construction

So, how is the grid constructed?

The grid is constructed by passing the image thru a CNN, with a downsampling stride. Image size is 4164163, so assuming a 32 strides downsampling (which is actually the case), the output dimenssions would be a 13 x 13 x N box. Where:

$N=3 x (5+N_{classes})$

That output structure is the illustrated above box diagram.

To enhence detection performance for smaller objects, YOLOv3 CNN generates output in 3 grid scales simultaneously: a 13 x 13 grid (as depicted above), and also 26 x 26 and 52 x 52 grids.

YOLOv3 Block Diagrams

The below block diagrams describe YOLOv3 Forwarding and Training operation. Following chapters of this article present a detailed description of the 2 operation modes.

YOLOv3 Block Diagram: Forwarding

alt text

YOLOv3 Block Diagram: Training

alt text

The next 2 chapters detail the functionality of Training and Forwarding modes diagrams, following the presented above block diagrams.

YOLOv3 Training Functionality

This section details YOLOv3 Training functionality, following the presented above Training block diagram:

Training Dataset
Pre-Process Image
CNN Model and Decode
Loss Calculation
Gradient Descent Update

1. Training Dataset

The training dataset consists of both images examples and their related metadata. The metadata is created per each of the 3 scales. The metadata is arranged to have the same structure like that of the detection descriptor presented in the YOLOv3 introduction section.

For convininence the diagram is posted herebelow again:

Training Arranged Metadata

alt text

Let’s illustrate the generation of that metadata with an example:

Herebelow is a training image example:

Training Image Example with Bounding Box Attonations

alt text

The table below presents the 4 objects’ metadata:

#	x	y	w	h	Objectness	Class
1	104	144	112	64	1	Trapezoid
2	250	180	98	104	1	Circle
3	120	272	108	77	1	Hexagon
4	278	336	115	83	1	Ellipse

To construct the training label records, some data arrangements should be taken:

Class Data - Should be arranged as a list of $N_{class}$ priorities. Here $N_{class}$ is 6, since dataset consists of 6 classes: Trapezoid, Circle Heagon, Ellipse, Square and Triangle.

So presentation should be of a one hot format like so:

Trapezoid: 1, 0, 0, 0, 0, 0 Circle: 0, 1, 0, 0, 0, 0 Hexagon: 0, 0, 1, 0, 0, 0 Ellipse: 0, 0, 0, 1, 0, 0

Still, to improve performance, we apply Label Smoothing, as was proposed in Rethinking the Inception Architecture for Computer Visionby Szegedy et al in Rethinking the Inception Architecture for Computer Vision.

with Label Smmothing, the one-hot probability of y given x, (marked by \((p(y

x) =delta_{y,x}\) ), is smoothed according the formula below:

Label Smoothing Formula

\(p_{smoothed}(y

x)=(1-\epsilon)\delta_{y,x}+\epsilon * u(y)\)

Where:

$\delta(y x)$ is the original one-hot probability
$\epsilon$ is the smoothing parameter, taken as $\epsilon=0.01$
$u(y)$ is the distribution over lables, here assumed uniform distribution, i.e. $u(y)=\frac{1}{6}$.

Pluging the above to Label Smoothing Formula gives:

Objects Class Label Smoothed Probabilities

	p(Trapezoid)	p(Circle)	p(Hexagon)	Ellipse	p(Square)	p(Triangle)
Trapezoid	0.990125	1.25e-4	1.25e-4	1.25e-4	1.25e-4	1.25e-4
Circle	1.25e-4	0.990125	1.25e-4	1.25e-4	1.25e-4	1.25e-4
Hexagon	1.25e-4	1.25e-4	0.990125	1.25e-4	1.25e-4	1.25e-4
Ellipse	1.25e-4	1.25e-4	1.25e-4	0.990125	1.25e-4	1.25e-4

The training data is used for loss function computation for the 3 grid scales. To make training data ready for this loss computations, we pack the training labels in 3 label arrays, each relate to a grid scale.

The diagram below shows a 13x13 grid over the image:

Training Image with Anottations with a 13x13 Grid

alt text

The coarse 13x13 grid diagram shows that the 4 objects are located at cells (3, 4), (7, 5), (3, 8) and (8, 10).

Table below presents the objects related cells per each of the 3 grids.

#Grid Size	Cell Location Object #1	Cell Location Object #2	Cell Location Object #3	Cell Location Object #4
13x13	3, 4	7, 5	3, 8	8, 10
26x26	6, 8	14, 10	6, 16	16, 20
52x52	12, 16	28, 20	12, 32	32, 40

To make training data ready for loss computations, we pack it in 3 label arrays, with shape:

$\text{labels.shape=Batch} * \text{Grid Size} * N_{boxes} * (5+N_{classes})$

In our example:

$\text{coarse-lables.shape=Batch} * 13133 * 11$

$\text{medium-lables.shape=Batch} * 26263 * 11$

$\text{fine-lables.shape=Batch} * 52523 * 11$

Now let’s fill in the data to the lable arrays:

coarse-grid lables

The network path to the coarse grid output passes through 32 strides, so accordingly the related grid cells indices are:

index_x1, index_y1 = int(104/32), int(144/32)
index_x2, index_y2 = int(250/32), int(180/32)
index_x3, index_y3 = int(120/32), int(272/32)
index_x4, index_y4 = int(278/32), int(336/32)

index_x1, index_y1 = 3, 4
index_x2, index_y2 = 7, 5
index_x3, index_y3 = 3, 8
index_x4, index_y4 = 8, 10

medium-grid lables

The network path to the coarse grid output passes through 16 strides, so accordingly the related grid cells indices are:

index_x1, index_y1 = 6, 8
index_x2, index_y2 = 14, 10
index_x3, index_y3 = 6, 16
index_x4, index_y4 = 16, 20

fine-grid lables

The network path to the coarse grid output passes through 16 strides, so accordingly the related grid cells indices are:

index_x1, index_y1 = 12, 16
index_x2, index_y2 = 28, 20
index_x3, index_y3 = 12, 32
index_x4, index_y4 = 32, 40

Let Batch=0:

coarse-grid-lables[0,3,4,0,:] =  (0,104,144,112,64,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

coarse-grid-lables[0,7,5,0,:] = (0,250,180,98,104,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

coarse-grid-lables[0,3,8,0,:] = (0,120,272,108,77,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

coarse-grid-lables[0,8,10,0,:] = (0,278,336,115,83,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

medium-grid-lables[0,6,8,0,:] =  (0,104,144,112,64,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

medium-grid-lables[0,14,10,0,:] = (0,250,180,98,104,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

medium-grid-lables[0,6,18,0,:] = (0,120,272,108,77,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

medium-grid-lables[0,16,20,0,:] = (0,278,336,115,83,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

fine-grid-lables[0,12,16,0,:] =  (0,104,144,112,64,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

fine-grid-lables[0,28,20,0,:] = (0,250,180,98,104,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

fine-grid-lables[0,12,36,0,:] = (0,120,272,108,77,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

fine-grid-lables[0,32,40,0,:] = (0,278,336,115,83,1,0.990125,1.25e-4,1.25e-4,1.25e-4,1.25e-4,1.25e-4)

2. Pre-Process Image

The input images should be resized to 416 x 416 x 3, but preserve the original aspect ratio.

Here’s a pseudo code for image resize. It is followed by an illustrative example.

Note: tf.resize has a preserve_aspect_ratio attribute, so one could consider using it.

    yolo_h, yolo_w    = 416
    orig_h, orig_w _  = image.shape

    scale = min(yolo_w/orig_w, yolo_h/orig_h)
    scaled_w, scaled_h  = int(scale * orig_w), int(scale * orig_h)
    resized_image = np.resize(image, (yolo_w, yolo_h))

    padded_image = np.full(shape=[yolo_h, yolo_w, 3], fill_value=128.0)
    d_w, d_h = (yolo_w - orig_w) // 2, (yolo_h - orig_h) // 2
    padded_image[d_h:orig_h+d_h, d_w:orig_w+d_w, :] = resized_image

Example

Here’s an illustration of the above pseudo code.

Input Image

alt text

orig_h, orig_w _  = 200, 300

scale = min(416/300, 416/200)
scale = 1.386666667

scaled_w, scaled_h  = int(1.386666667 * 300), int(1.386666667 * 200)
scaled_w, scaled_h  = 416, 277

d_w, d_h = 0, 69

Illustrating Animation

alt text

3. CNN Model and Decode

This section presents YOLOv3 CNN, along with its output decoding part, both are parts of the model’s graph.

YOLOv3 CNN is an FCN - A Fully Convolution Network, as it is composed of convolution modules only, without any fully connected component.

The CNN is based on Darknet-53 network as its backbone.

Here below is a high level block scheme of the YOLOv3 CNN. It is followed by a more detailed diagram of same YOLOv3 CNN.

YOLOv3 CNN Hige Level Block Diagram

alt text

YOLOv3 CNN Detailed Block Diagram

alt text

Looking at the above diagrams, one can observe 3 sub-module types:

Darknet-53, CNN’s backbone.
Three CNN paths, one per each grid scale
Decode modules which make a post-process on CNN’s output, before loss function computation.

Next section drills inside the modules, providing detailed insights on architecture.

Darknet-53

Take a look at the Darknet-53 part in the above block diagram and note that:

Darknet-53 is structured as a cascade of ConvBlocks and ResBlocks.
The x1, x2, x8, x4 notations on top of the ResNet blocks in the diagram above, indicate of the duplication number of the same module.
Each of the 5 ConvBlocks downsamples by 2 (stride=2), for a total stride 32 at the top stage, and 16 and 8 at the stages before. Those stages feed the coarse, medium and fine scale grids respectively.

Here below is a Block diagrams of ConvBlock. ResBlock follows after.

ConvBlock

alt text

As depicted by the diagram, ConvBlock is a structure which combines a Conv2D module, a BatchNormalization module and a Relu Activation at the top - except for the output stages, where activation is not applied.
ConvBlocks have 2 flavors: either with or without downsampling. The downsampling with stride=2 flavor is implemented only within the Darknet-53 block.
ConvBlocks’ kernel size is 3 inside Darknet-53, while after that, kernel size alternates between 1 x 1 with N=512 and 3 x 3 with N=1024.

ResBlock

alt text

The Res-Block sums up the output of 2 Conv-Blocks with the input data x, aka skipped input. (If the ResBlocks look familiar - then it indeed is a reuse of the structure presented in the famous ResNet model.

The ResBlock structure provides 2 contributions:

It helps solving the vanishing gradient problem. (Vanishing Gradient Reminder:, during training back propogation, the gradients are calculated using the chain rule, so a gradient is a multiplication product of previous gradients with its conv block partial derivative. In case the network is deep, gradient values become smaller in layers closer to the top of the network.). since ResBlock’s skip connection gradient is unity, the vanishing gradient is prevented.
The mixing of skip layer with the convolutioned layers refines the feature extraction, which benefits from more details provided by the skipped data and features provided by convolutioned layers output.

Why named Darknet-53?

Darknet-53 is because it had 53 layers. However, count of all conv2D elements in Darknet-53, (considering 2 conv2D elements in a ResBlock), sums up to: 1+1+2+1+4+1+16+1+16+1+8=52, not 53.

Still, it had 53 layers, but the fully connected output layer at the top was omitted when deployed in YOLOv3.

3 Scale Paths

YOLOv3 most noticeable improvement wrt earlier YOLOs, was the detection in 3 scales. This improvement enhanced smaller object detection performance, which was a weakness of previous YOLO versions.

As depicted by the above CNN diagram, the 3 scale paths are similarly structured, each with 7 convolution blocks. However, each scale is reduced by a factor of 2 in input data strides and ConvBlocks’ number of filters.

The Concatenation Block

The medium and fine grained paths have an extra Concatenation Block, (See diagram’s 3rd and 4th rows from top). It concatenates data sourced by Darknet-32’s intermidiate stage, together with upsampled data from the preceding scale level path.

alt text

Why Concatenating? The concatenation module’s contribution is quite similar to the Res Block’s effect on feature detection refinement. Still, concatenation is applied and not summation, since the 2 datas are sourced by different network stages, which leaves no point for summation.

Decode Module

Decode module is applied on CNN output, namely x,y,w,h,Objectness,Class Priorities, preparing it for Loss Function computation.

Table below summerizes Decode functionality over the data. It is followed by a detailed description of Decode.

Input	Operation	Output
x,y	Limit results to 0<=x,y<=1 by applying Sigmoid. Upscale Coordinates to Original Size	x,y, in 416*416 scale
w,h	Decode w,h values, Upscale Coordinates to Image Original Size	w,h in 416*416 scale
Objectness Probability	Limit value to 0<=Obj<=1 by applying Sigmoid()	0<=Obj<=1
Class Probabilities	Limit values to 0<=P<=1 by applying Sigmoid()	0<=Probabilities<=1

Decode x,y Coordinates

Let x,y be the CNN predicted values for the location of a bounding box center. Given that x and y, the bounding box center can be computed as presented in the diagram below and in the expression which follows.

alt text

$x_c = c_x + \sigma(x)$

$y_c = c_y + \sigma(y)$

Decode w,h Coordinates

To improve performance of bounding box predicton, YOLOv3 uses anchors for for bounding boxes width and height prediction. An anchor consists of a width and height parameter. YOLOv3 is provisioned with 9 anchor sets, 3 per each scale.

The 9 anchor boxes are generated by performing k-means clustering on the dimensions of the Training data boxes. After that, the 9 anchors are distributed among the 3 scale processes in a decending order - the 3 largest to the coarse scale process and the 3 smallest to the fine scale process.

Accordingly, the CNN does not compute width and height directly, but only parameters for the formula listed below:

$w=exp(w)*\textrm{anchor_w}$
$h=exp(h)*\textrm{anchor_h}$

BTW, amongst all decoded parameters, only w, h are the not activated by Sigmoid, as their value is restricted to be less equal 1.

Decode Objectness

Objectness holds the probability of an object within the cell. Decode applies a Sigmoid on this parameter, thus confirming value in range 0 <=Objectness <=1

$objectness = sigmoid(objectness)$

Decode Class Probability

Class Probability is also activated by a Sigmoid. Alternatively, a Softmax could be activated, which would require a different Loss Function than we will use here.

$\textrm{class prob} = sigmoid(\textrm{class prob} )$

4. Loss Calculation

Loss Function determines the difference between the expected results and the predicted output. Final objective is to minimize this difference. Thw minimization is produced by an optimization algorithm, which is the next block, covered in the next section.

The overall loss is a sum of all prediction losses, i.e.:

Bounding Box Prediction Loss
Objectness Prediction Loss
Class Prediction Loss

Subsection which follow detail each of the 3.

Bounding Box Prediction Loss

Bounding Box Prediction Loss is dependent on 4 prediction parameters loss, i.e. x,y,w,h. There are various candidates for Loss Functions. Here we will use IOU.

Alternatively, the loss could be taken directly as a sum of x,y,w,h prediction errors like so:

$x_{loss} = \sum_{i=0}^{N_{bbox}}(x^i_{true} - x^i_{predicted})^2$

$y_{loss} = \sum_{i=0}^{N_{bbox}}(y^i_{true} - y^i_{predicted})^2$

$w_{loss} = \sum_{i=0}^{N_{bbox}}(w^i_{true} - w^i_{predicted})^2$

$h_{loss} = \sum_{i=0}^{N_{bbox}}(h^i_{true} - h^i_{predicted})^2$

Where number of bounding boxes is:

$N_{bbox} = BatchSize * GridSize * BoxesInGridCell$

(e.g. For the Coarse Scale Grid, where BatchSize=10, GridSize=13x13, BoxesInGridCell=3, $N_{bbox}=101693=5070$)

But as noted, here we use IOU for Bounding Box Loss calculation.

IOU In Brief: IOU (Intersection over Union) is a term used to describe the extent of overlap of two boxes. The greater the region of overlap, the greater the IOU

IOU is a metric used in object detection benchmarks to evaluate how close the predicted and ground truth bounding boxes are. IOU, as its name indicates, is expressed as the ratio between the intersection area and the union area of the 2 boxes. IOU expression is listed below, followed by an illustrative diagram.

$IOU=\frac{S_{true}\cap S_{pred}}{S_{true} \cup S_{pred}}$

alt text

IOU Loss Function

$iou_{loss} = 1 - iou$

Now that IOU is clear, here’s an IOU drawback: it is indifferent for all zero intersection - it is always 0 as illustrated in the diagram below.

IOU Zero Intersection

alt text

To overcome this drawback, we introduce GIOUI - a modified IOU algorithm, [Hamid Rezatofighi et al, Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression](https://arxiv.org/abs/1902.09630.

GIOU

GIOU stands for Generalized IOU. It adds the bounding boxes clossness criteria by considering the difference between the minimal area which encloses the boxes and the boxes union area:

$GIoU=IoU - \frac{S_{enclosed} - {(S_{true} \cup S_{pred})}}{S_{enclosed}}$

Animated diagrams below illustrate $S_{enclosed}$ as a function of boxes clossness.

alt text

The final expression used for GIoU Loss adds the consideration of whether there is indeed an object in the cell. This is expressed by $Objectness_{true}$, the ground truth objectness, which value is True if an object indeed exists in the cell, and False otherwise.

$giou_{loss} = Objectness_{true} * (1-giou) $

!!!! RONEN REMOVE GAMA from CODe gives weights to small area!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1

GIoU Loss - Array Shape

GIoU Loss, same as the 2 other loss functions, is calulated per each bounding box. Accordingly, lose shape is:

Batch * 13 *13 *3 for coarse grid
Batch * 26 *26 *3 for medium grid
Batch * 52 *52f *3 for medium grid

Objectness Prediction Loss

Objectness expresses the probability of an object exisiting in the cell. The Objectness CNN output passes through a Sigmoid activation block - as detailed in the Decoder section of this article.

Consequently, the loss function can be computed by tf.sigmoid_cross_entropy_with_logits():

$\textbf{sigmoid_cross_entropy_with_logits(labels=objectness_true, logits=objectness_pred)}$

The Objectness Loss is ignored in case if there is no object in the cell, but maxIoU (The maximal IoU computed for this cell) is above threshold, i.e. it may indicate of an object.

Table below lists the various states:

Objectness Ground Truth	maxIoU < IoULossThresh	Objectness Loss
True	True	Valid
True	False	Valid
False	True	Valid
False	False	Ignore

Final expression for Objectness Loss:

$objectness_{loss} = (objectness_{true} +(1.0 - objectness_{true}) * ( maxIoU < IoULossThresh) )* \text{sigmoid_cross_entropy_with_logits}(objectness_{true}, objectness_{pred})$

Class Prediction Loss

Similar to the Objectness prediction, The classification predictions are also passed through Sigmoid activation - as detailed in the Decoder section.

Consequencly, tf.sigmoid_cross_entropy_with_logits() is used here too as the loss function.

The Classification Loss is ignored if there is no object in the cell i.e.

The Class loss are considered only when Objectness ground true value is True, i.e. there is an object in the cell.

The expression for class loss follows: $objectness_{true} = False$.

So the final expression for Classification Loss is:

$class_{loss} = obj_{true} * \text{sigmoid_cross_entropy_with_logits}(obj_{true}, obj_{pred})$

Classification Loss Shape

Classification Loss, as the 2 other loss functions, is calulated per each bounding box. Accordingly, shape is:

$Classification_{loss}.shape$ = Batch x grid_size x grid_size x 3 x num_of_classes

Batch * 13 *13 *3 for coarse grid
Batch * 26 *26 *3 for medium grid
Batch * 52 *52f *3 for medium grid

The Classification Loss is calulated per each bounding box. The shape is:

Batch x grid_size x grid_size x 3 x num_of_classes

Where:

grid_size; 13, 26, 52 for coarse, medium and fine grid respectively. num_of_classes: e.g. 80 for coco dataset

Total Loss Function

The total Loss is the sum of all 3 losses in all grid cells, in all 3 grid scales/

total_loss = giou_loss + conf_loss + prob_loss

5.Gradient Descent Update

alt text

Running Train session under tf.GradientTape ensures the watching of all trainable variables. Following that, the gradient computation and the model update which follows is straight forward.

   with tf.GradientTape() as tape:
        pred_result = model(image_data, training=True)
        total_loss = compute_loss(pred_result, labels)
        gradients = tape.gradient(total_loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

YOLOv3 Forwarding Functionality

YOLOv3 Forwarding was depicted by the YOLOv3 Forwarding Block Diagram.

Most of the functional blocks are same as detaiiled above Training Mode’s. Still, there is no Loss computation and a Gradient Descent learning mechanism. Instead, weights are either loaded from a file, or were pre-trained before.

The Post Process, marked by #5 in the diagram, is unique to the Forwarding process. Let’s drill into it.

Forwarding Post Process

alt text

Bounding Boxes Candidates

The Post Processing block selects the final detections amongst all detection candidates it receives. More precisely, select relevant detections amongst the 10647 bounding boxes candidates generated by CNN.

10647 bounding boxes? Let’s show that:

CNN generates 13x13x3 detection descriptor records by the coarse scale path, along with 26x26x3 and 52x52x3 records generated by medium and fine scale CNN paths respectively.

Total number of input records is then:

13x13x3 + 26x26x3 + 52x52x3 = 10647

The structure of each detection descriptor records is illustrated (again) by the diagram below:

alt text

Next, these 10647 bounding boxes pass through 2 selection filters:

Low Probability Candidates Mask
NMS

These are discussed next.

Rescale Bounding Boxes Coordinates

The bounding box coordinates are rescaled to fit the original image dimenssions.

Let’s illustrate that rescale with an exampe:

Here below is the CNN 416x416 output image, with boundbox annotations arround the ellipse object:

alt text

Original dimenssions are 200x300.

The computation of offset shifts and resize ratio are listed below. An anumation illustration follows.

\\(w_{original}, h_{original}= 200,300\\)

\\(resize_ratio = min(\frac{416}{w_{original}}, \frac{416}{h_{original}})\\)

\\(resize_ratio = min(\frac{416}{300}, \frac{416}{200}) = 1.386666667\\)

\\(dh=int((416-resize_ratio*w_{original})/2)=int((416-200*1.386666667)/2)=69\\)

Bbox Rescale Illustration

alt text

Low Probability Candidates Filter

This module filters out bbox candidates with low probability, as expressed in the formula below:

scores = pred_conf * class_probability

if scores < score_threshold:
  discard

scores = pred_conf * pred_prob[np.arange(len(pred_coor)), classes] score_mask = scores > score_threshold

NMS

Non Max Suppression - NMS, aims to remove bounding boxes overlaps. NMS selects bounding boxes with higher predicted class probability, and discards all bounding boxes with same class category, which have a high overlap with theselected boxes. The amount of overlap is metered by IOU.

The algorithm is described in the pseudo code flow chart:

alt text

https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b#:~:text=More%20bounding%20boxes%20per%20image&text=On%20the%20other%20hand%20YOLO,it’s%20slower%20than%20YOLO%20v2.

https://lilianweng.github.io/lil-log/2017/12/31/object-recognition-for-dummies-part-3.html

https://vivek-yadav.medium.com/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807 : , YOLOv2 does not assume the aspect ratios or shapes of the boxes.

https://machinelearningmastery.com/how-to-perform-object-detection-with-yolov3-in-keras/

http://christopher5106.github.io/object/detectors/2017/08/10/bounding-box-object-detectors-understanding-yolo.html

effectively splits the image into grids of arbitrary size

from https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b#:~:text=More%20bounding%20boxes%20per%20image&text=On%20the%20other%20hand%20YOLO,it’s%20slower%20than%20YOLO%20v2. :

YOLO is a fully convolutional network and its eventual output is generated by applying a 1 x 1 kernel on a feature map

https://nico-curti.github.io/NumPyNet/NumPyNet/layers/route_layer.html : Route Layer In the YOLOv3 model, the Route layer is usefull to bring finer grained features in from earlier in the network. This mean that its main function is to recover output from previous layer in the network and bring them forward, avoiding all the in-between processes. Moreover, it is able to recall more than one layer’s output, by concatenating them. In this case though, all Route layer’s input must have the same width and height. It’s role in a CNN is similar to what has alredy been described for the Shortcut layer.

In the YOLOv3 applications, it’s always used to concatenate outputs by channels: let out1 = (batch, w, h, c1) and out2 = (batch, w, h, c2) be the two inputs of the Route layer, then the final output will be a tensor of shape (bacth, w, h, c1 + c2), as described here. On the other hand, the popular Machine Learning library Caffe, let the user chose, by the Concat Layer, if the concatenation must be performed channels or batch dimension, in a similar way as described above.

Our implementation is similar to Caffe, even though the applications will have more resamblance with YOLO models.

An example on how to instantiate a Route layer and use it is shown in the code below:

PREVIOUSResnet - Deep Residual Learning for Image Recognition