This is the first in series of 3 deep learning intro posts:
This set of post introduces Deep Learning, in the context of Supervised Machine Learning. Starting top down, Figure 1 depicts a diagram which describes the 3 main modules of deep learning system modules, To begin with speaking about Deep Learning, I have to recommend of the book named “Deep Learning Adaptive Computation and Machine Learning” by goodfellow, Bengio and Courville, which is my reader book. DLAC&ML reviews briefly the history of deep learning technology and research, which dates long ago, already in the 1940s, though the names and popularity of this field had changed through the years. Deep Learning is quite a recent name. One of its anccestors is the Artificial Neural Networks (ANNs) which research tried to model the biological brain. The modern term Deep Learning has less pretension to mimic the brain or understand its operation, but is still inspired by neurosience and the brain like model of many simple computational elements interacting together, using a single algorithm, to create an intelligent unit which solves many kinds of different problems. We will see in this post that the Deep Learning network AKA Neural Network, is based on many instances of a simple element named Neuron.
Figure 1 depicts a Neural Network. Input Data is on the left side. The data is forwarded through the network’s layers up to the Output-Layer. The network’s elements call Neurons are simple computational elements, as depicted in Figure 2.
Following Figure 1, here below are some commonly used terms:
By examining Figure 2, let’s drill into the structure and elements which constitute the Neural Networks. Figure 2 depicts a Nueral Network with less layers and neurons then in Figure 2, to make it more easy to present it with more details. It is followed by corresponding explaination and notaitions.
Some explaination and notations:
Now that the Neuron’s interconnections are introduced, let’s drill into the Neuron’s.
Figure 3 presents a Neuron.
Following the scheme left to right, then we see the following:
Now that we’re familiar with the Neuron, Figure 4 illustrates the parameters’ indices assignemnt conventions, by focusing on Neuron 2 from layer to of Figure 2. Following the scheme left to right, the reader can verify that the indices are according to the conventional index scheme/
The previous section presented the Neuron, with its non-linear activation function g(z). But g(z) was not introduced. To complete the introduction on network’s building blocks, 4 commonly used activations are presented here below:
Maybe here’s the right place to comment about how essential the non-linear activations are: In the absence of a non-linear activation functions, the Neural Network would be a cascade of linear functions, which could be replaced by a single Neurone with a linear function. There could be no benefit over a single Neurone.
Sigmoid was introduced in the Logistic Regression post. With a decision threshold at 0.5, a range of [0,1], and a steep slope, Sigmoid is suitable a a binary decision function. and indeed it’s very commonly used for binary classification. Still, the sigmoid values flatens as at higher values of z. This the “Vanishing Gradient” problem, with which optimization algorithms such as Gradient Descent will not merge or merge very slowly.
It’s easy to see, by multiplying numerator, as shown in Eq 3. and denominator by \(e^{-x}\), that tanh is a scaled sigmoid. It is also depicted in Figure 6 that tanh is a scaled sigmoid, centered around 0 instead of 0.5 with values [-1,1]..
Tanh usually works better than Sigmoid for hidden layers. Actually, sigmoid is rarely used for hidden layers, but only for output layers, where the output is expected to be 0 or 1.
relu(x)=max(0,x)
Relu solves the “Vanishing Gradient” problem. Derivative is 1 for the positive value. The derivative at x=0 is not defined, but that’s not an issue and can be set to either 0 or 1. Relu implementation is simpler and cheaper computation wise then the other activation functions. is commonly used, actually it’s in most cases the default activation function. Problem with Relu is the 0 gradient for negative values, so all units with negative value will slow down learning. Still, not considered a critical issues, as about half og the hidden unit are still expected to have values greater than 0. Leaky Relu solves the 0 gradient issue anyway.
###
leaky_relu(x)= max ? x: 0.01*x
Leaky Relue adds a slope to the negative values, preventing the 0 gradient issue. The slope he is set to 0.01.
Figure 9 depicts a diagram of Deep Learning’s 2 operation modes: Prediction and Training. In Prediction mode, depicted in Figure 9a, the system models a predicted output value of output decision for the input data. In this phase, the data is forwarded through nodes of the networks’ layers, in a process called Feed Forward. In Training mode, depicted in Figure 9b, which precedes the Prediction mode, the Gradient Descent optimization algorithm caluclates the set of parameter which minimizes a pre-determined cost function. Gradient Descent works in conjunction with the Back Propogation algorithm, which while striding backwords through the network’s layers, calculates the gradients needed by Gradient Descent equations.
Feed Forward and Back PRopogation are detailed in next posts.