This is the third in series of 3 deep learning intro posts:
This post presents in details the development for the set of equations used to find the network’s optimized set of coefficents. These are the equations used during the Training stage, aka Fitting. The algorithms used are Gradient Descent and Back Propogation. The post egins with a short recaps for Feed Forward process, Cost Functions and Gradient Descent. After that, we will dive into the develpoment of Back Propogation equations.
In this section we recap Feed Forward equations, which are needed as a background for the current Back Propogation topic. Figure 1, which is re-posted here, presents journey of the input data through the L network’s layers, in that example L=5, untill the predicted value \(hat{y}\) is generated. Each layer consists of 2 cascaded operators, the first calculates the sum of the weighted input and the bias, and the second is a non linear activation function. Eq. 1 presents the above 2 functions expressed for any layer l.

\(\bar{Z}^{[l]}=\bar{w}^{[l]}\bar{A}^{[l-1]}+\bar{b}^{[l]}\)
Recap ends here. Let’s proceed - we will return to the above equations.
Given the input-output set (x,y) and the prediction result \(\hat{y}\), a cost function measures the difference between the true values and the model’s prediction results. The optimized set of parameters we search for, is the one that minimizes a cost function.
We already met 3 types of Cost functions as shown in Eq. 2:
\(C(w,b)=\frac{1}{m}\sum_{j=1}^{m}\left | y-\hat{y} \right |^2\)
| \(C(w,b)=\frac{1}{m}\sum_{j=1}^{m}\left | y-\hat{y} \right | \) |
\(C(b,w) =-\frac{1}{m}\sum_{i=1}^{m}[y_i^{(i)}log(\hat{y}^{(i)})+(1-y^{(i)})log(1-\hat{y}^{(i)})]\)
Just a side note to prevent confusion between Loss and Cost functions - Loss Function measures the difference between true value (y), and the model’s prediction results (\(\hat{y}\)), while Cost function is the average over a batch of m Losses, e.g. avarage over the entire training sequence of over a partial batch of it.
With a selected cost function at hand, we will use the Gradient Descent algorithm to find the optimized set of parameters.
If you’re not familiar with Gradient Descent, it is suggested you read the posts on Gradient descent before continueing.
The Gradient Descet equations are presented by Eq. 3, Where the superscript [l], l=[1,L], denotes the layer, and \(\alpha\) is the learning rate.
\(w^{[l]}=w^{[l]}-\alpha\frac{\partial C}{\partial w^{[l]}}\)
Where 1<=l<=L
As shown by Eq. 3, to solve the recursive equations, Cost function’s derivatives with respect to all layers’ coeffcients are needed.
Note that besides Gradient Descent, there are other optional optimizing algorithms, such are the itsa commonly used variants Stochastic Gradient Descent and ADAM. we will cover those algorithms in an exclusive post.
In this section, we start the journey of finding the optimized network’s parameters.
Our objective is to find the partial derivatives of the Cost function with respect to all layer’s coefficients, i.e. find \(\frac{\partial C}{\partial w^{[l]}}\) and \(\frac{\partial C}{\partial b^{[l]}}\) for all 1<=l<=L. To acomplish that, we will use the derivative chain rule, while propogating backwards through the network’s layers, starting from the output layer, towards the input layer. Eventually, we will develope 4 equations, which will fit all layers, one after the other, while propogating back from output to input.
Note: The equations developemnt in this post are for a batch of m training examples, aka ‘vectorized equations’. You may find Back Propogation resources which present the equations for a single input data example. The latter equations are involved with less matrices and more vectors instead, which makes them simpler. However, the training is actually calculated for a batch of m examples, which is a more computationally efficient.
We start with eamination of the outputlayer, i.e. layer L, and then generalize the resultant equations to all layers.
Figure 2 illustrates the Feed Forward in the output layer. This illustration is similar to the output layer of Figure 1, except it now illustrate vectorized components, with m columns of input data. The figure depicts output layer’s Feed Forward computational chained 3 equations.

Following Figure 2’s equations chain, using the derivatives chain rule, we get the following 4 derivatives equations Eq. 4a-Eq.4d. Note that Eq. 4b and Eq. 4c are bassically the equations we are after! We now just need to generalize the eqauations to all layers, and, translate the derivatives to a more detailed form.
\(\delta^{[L]}=\frac{\mathrm{d} C}{\mathrm{d} Z^{[L]}}=\frac{\mathrm{d} C}{\mathrm{d} A^{[L]}} \cdot \frac{\mathrm{d} A^{[L]}}{\mathrm{d} Z^{[L]}}\)
\(\frac{\mathrm{d} C}{\mathrm{d} w^{[L]}}=\delta^{[L]} \cdot \frac{\mathrm{d} Z^{[L]}}{\mathrm{d} w^{[L]}}\)
\(\frac{\mathrm{d} C}{\mathrm{d} b^{[L]}}=\delta^{[L]} \cdot \frac{\mathrm{d} Z^{[L]}}{\mathrm{d} b^{[L]}}\)
\(\frac{\mathrm{d} C}{\mathrm{d} A^{[L-1]}}=\frac{\mathrm{d} Z^{[L]}}{\mathrm{d} w^{[L]}}\delta^{[L]}\)
A note before we continue the work of derivation.
Eq. 5a presents the cost function, expressing the output of the last activation in terms of the networks weight matrices and activations. (The bias coefficients are ommitted for the sake of simplicity). Eq. 5b is the Cost’s gradient expression with respect to the input data. Note that the matrices are transposed and the order of multiplication is reversed. That explains the transposed matrix in the coming expressions.
### Eq. 5: Cost Function and Cost Gradient
#### Eq. 5a: Cost Function with respect to \(A^{[0]}\) \(C(y,A^{[L]})=g^{[L]}(w^{[L]}g^{[L-1]}(w^{[L-1]}g^{[L-2]}(……g^{[1]}(w^{[1]}A^{[0]})))) \)
#### Eq. 5a: Gradient of Cost Function with respect to \(A^{[0]}\)
\(\nabla _{A^{[0]}}C=W^{[1]T} g^{‘[1]}…..W^{[L-1]T} g^{‘[L-1]}W^{[L]T} g^{‘[L]}\nabla _{A^{[L]}}C \)
In this section, we’ll arrange Eq. 4, in order to produce simpler expressions.
Derivate Eq. 1a with respect to \( w^{[L]}\) and get:
\(\frac{\mathrm{d} Z^{[L]}}{\mathrm{d w^{[L]}}}=A^{[L-1]}\)
Derivate Eq. 1a with respect to b^{[L]} and get:
\(\frac{\mathrm{d} Z^{[L]}}{\mathrm{d b^{[L]}}}=\begin{bmatrix}
1\\\
1\\\
.\\\
.\\
1\
\end{bmatrix}\)
The derivative result is an all 1s m x 1 column.
Derivate Eq. 1a with respect to A^{[L-1]} and get:
\(\frac{\mathrm{d} Z^{[L]}}{\mathrm{d A^{[L-1]}}}=W^{[L]}\)
Denote the activation derivative by \(g^{‘[L]}\)
\(g^{‘[L]}=\frac{\mathrm{d} A^{[L]}}{\mathrm{d} Z^{[L]}}\).
Where g() is the activation function.
(We will detail the derivative of the various activation functions in a following post).
Plug Eq. 6 - Eq. 9 into Eq. 4, to have a simpler set of Back Propogation equations:
\(\delta^{[L]}=\frac{\mathrm{d} C}{\mathrm{d}{A^{[L]}}} \odot g^{‘[L]}\)
Where \(\odot\) denotes element wise multiplication.
\(\frac{\mathrm{d} C}{\mathrm{d} w^{[L]}}=\frac{1}{m}\delta^{[L]}\cdot{A^{[L-1]T}}\)
To get normalized results, independent of number of examples m, the integration of m examples requires a scaling down by m.
\(\frac{\mathrm{d} C}{\mathrm{d} b^{[L]}}=\frac{1}{m}np.sum(\delta^{[L]},axis=0,keepdims=True) \)
The b matrix is squashed to a single column, by averaging the m columns - summing and dividing by m. Reason - gradient descent update runs once for every batch cycle, over the averaged values of \(b^{[L]}\).
\(\frac{\mathrm{d} C}{\mathrm{d} A^{[L-1]}}=W^{[L]T}\delta^{[L]}\)
Note that: Eq. 10a is just a pre-computation which prepares \(\delta^{[L]}\) for the other equations. Eq. 10b and Eq. 10c are the ones which produce the required derivatives for this layer. Eq. 10d prepares the input needed by L-1, the next layer in the back propogation process
Mission acomplished for layer L: Qirg Eq. 10a-Eq. 10c we can find the required derivatives!
Looking at the symetry of Eq. 1, it’s easy to see that Eq. 10 is valid for any l. So let’s re-write the equations again, substituting L by l, where 1<=1<=L.
\(\mathbf{\delta^{[l]}=\frac{\mathrm{d} C}{\mathrm{d}{A^{[l]}}} \odot g^{‘[l]}}\)
\(\mathbf{\frac{\mathrm{d} C}{\mathrm{d} w^{[l]}}=\frac{1}{m}\delta^{[l]}\cdot{A^{[l-1]T}}\)
\(\mathbf{\frac{\mathrm{d} C}{\mathrm{d} b^{[l]}}=\frac{1}{m}np.sum(\delta^{[l]},\textbf{axis=0,keepdims=True}}) \)
\(\mathbf{\frac{\mathrm{d} C}{\mathrm{d} A^{[l-1]}}=W^{[l]T}\cdot\delta^{[l]}}\)
A conventional modification in equations’ notaions: Cost function derivatives are conventionally denoted in a compact notation which better suits the program code, so \(\frac{\mathrm{d} C}{\mathrm{d}{A^{[l]}}}\) is denoted by \(dA^{[l]}\), \(\frac{\mathrm{d} C}{\mathrm{d} w^{[l]}}\) by \(dw^{[l]}\) and so forth, as presented by Eq. 12.
\(\mathbf{\delta^{[l]}=dA^{[l]} \odot g^{‘[l]}}\)
\(dw^{[l]}=\frac{1}m{A^{[l-1]T}}\cdot\delta^{[l]}\)
\(db^{[l]}=\frac{1}{m}np.sum(\delta^{[l]},\text{axis=0, keepdims=True})\)
\(dA^{[l-1]}=W^{[l]T}\cdot\delta^{[l]}\)
Figure 3 illustrates Forward Propogation and Back Propogation. Some Notes on the diagram:

Figure 4 depicts the complete training process, which main blocks are Feed Forward and Back Propogation modules, and theGradient Descent Module. Here’s the process:

List the whole sequence
##### todo check link List the whole sequence https://medium.com/@jorgecardete/backpropagation-2eeb25201095