The previous post introduced the Linear Prediction model. It is recommended to read that post first, unless you are familiar with that material. It presented the Analytical Solution for calculating the model’s coefficents. The Analytical Solution is fairly simple solution to find coefficients of a Linear Predictor, except it requires an \(n X n\) matrix inversion, where n is the dimension of the input data vector. In many cases, that’s a non issue, unless the number of features is large, say >1000. Anyway, it’s a matter of computational cost and availability. The current post presents an alternative algorithm, using an optimized iterative solution, named Gradient Descent. Gradient Descent is a simple iterative solution, and commonly used for solving prediction models such as the Logistic Regression predictor, for which the analyitcal solution does not fit. BTW, Logistic Regression will be presented in a next post.
So we need to find the Linear Predictor’s set of coefficients \(b, w_j\), where j=1:n. (b stands for bias, and w for weight - see Eq. 1). Optimization methods need a Cost Function which measures the difference between the true values and the model’s prediction results during the Training phase, and an optimization algorithm, which will determine the set of coefficents, aka weights, which minimize the cost.
That’s what this post is about: Determining a cost function, and finding the coefficents which minimize it, using the Gradient Descent algorithm. So let’s start!
The Linear Prediction formula is listed in Eq. 1. Following that, Eq. 2 presents a Cost functiom, denoted by \(J(w,b)\). The Cost is expressed as the mean squared error between the real and modeled values. The Cost is a sum of that difference over m examples. Alternatively, Eq. 3 presnets another type of Cost function, which is the sume of absolute distances between the real and predicted values. The Cost function expressed in Eq. 2, is commonly used. It has some benefits: It is dependent on squared distances, so error increases faster than the eucleadean distance. In addition, the squared error is convex, which is sigificant for fininding the extreme point (minima or maxima), needed for our calculations. (What is a convex function? It’s a function that the line between 2 points on the graph are alays above the values of the points between those 2 points - see Figure 2.)
Figure 1 depicts graphically the euclidiean distance between the actual value y, and it’s corresponding model predicted value \(d=\hat{y}-y \) in a 1D domain.
Note that the the superscript i, in Eq 2 and throughout all the equations, denotes the ith example, where the labeled data sequence, AKA training seqence consists of m examples. Note that the \(frac{1}{m} \) factor is just a normalizing factor, but has no significance.
Figure 1 depicts graphically the euclidiean distance between the actual value y, and it’s corresponding model predicted value \(d=\hat{y}-y \).
Note that the \(frac{1}{m} \) factor is just a normalizing factor, but has no significance.


As stated, the cost function expressed by Eq. 2 is convex, since it is a quadratic equation. Number of variables are n+1. It is possible to illustrate the Cost function graphically, only for the cases where n+1<2. Accordingly, the Cost equation for n=1 is expressed by Eq. 4, and an example plot is shown in Figure 3.

Having the Cost function formula in hand, we can continure to our next goal: finding the set of coefficents which minimize it. Obviously, the point at which the Cost function is minimal, is the point where first derivative is 0. Gradient Descent algorithm will calculate the coeffcients at that point! Here’s the algorithm’s formula:
a: For \(w_j\) j=1:n: \(w_j:=w_j-\alpha \frac{\partial J(b,w) }{\partial w_j}\)
b. For the constant coefficient ((b\): \(b:=b-\alpha \frac{\partial J(b,w) }{\partial b}\)
Eq. 5 describes an iterative algorithm, which should run in paralel for all the n+1 variables, i.e. b and \(w_j\) j=1:n.
Before we formulate the algorithm procedure, plugging Eq. 6 into Eq. 2, resultant in Eq.7, it’s easy to get the partial derivatives, as shown in Eq. 8
\(\hat{y}^{(i)}=b+\sum_{j=1}^{n}w_j{x}^{(i)}_j\)
\(J(w,b)=\frac{1}{m}\sum_{i=1}^{m}(b+\sum_{j=1}^{n}w_j{x}^{(i)}_j-y^{(i)})^2\)
\(\frac{\partial J(w,b)}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)}) = \frac{1}{m}\sum_{i=1}^{m}(b+\sum_{j=1}^{n}w_j{x}^{(i)}_j-y^{(i)})\)
\(\frac{\partial J(w,b)}{\partial w_j} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})*x^{(i)}=\frac{1}{m}\sum_{i=1}^{m}(b+\sum_{j=1}^{n}w_j{x}^{(i)}_j-y^{(i)})*x^{(i)}\)
Figures 4a-4c, depict an illustration of a cost function, projected on one of the coefficent’s domain. The crosses markea on the Cost fuction represent the Gradient Descent iterations: the iteration’s coeffcients are calculated, and their corresponding Cost value is marked on the graph. Note that as the gradients decrease while striding towards the minima, the distancees between the iterations decrease accordingly, as implied by Eq. 5. At the Cost’s minima, where the gradient is 0, the Gradient Descent converges. Figure 4a depict the stride after 1 iteration. Firgure 4b presents 3 more strides, where the decrease of stride size can be easuly seen. Figure 4c presents more strides, till convergence at the minima.
a. After One Iteration

b. After Four Iterations

c. After Fourteen Iterationss

The convergence shown should be achieved simoultansouly on all n+1 dimensions. Figure 5 depicts Gradient Descent with 2 fimensional variables: b and \(w_1\). The graph presents points on the Cost’s graph, which relate to the calculated coefficents after each iteration.

There is no definete criteria for termination condtion. Here are 3 common termination criterias: a. Counting number of iterations - easy to implement, but convergence is not verified. b. Comparing the the gradient magnitute against a threshold i.e.
\[\left \| \bigtriangledown J(b,w) \right \| < \epsilon\]Where:
\[\left \| \bigtriangledown J(b,w) \right \|= \left \|\frac{\partial J(b,w)}{\partial b},\frac{\partial J(b,w)}{\partial w_1}....\frac{\partial J(b,w)}{\partial w_n} \right \|\]c. Check the calculated value and see it is not changing i.e. converged, i.e:
\[\left \| w_j{(i+1))}- w_j{(i))} \right \| < \epsilon\]Note that there is no general rule for the \(\epsilon\), in both options b and c.
The algorithm steps in the opposite direction of the gradient until convergence, as the gradient, multiplied by a constant \(\alpha\), AKA :earning Rate, is decremented from previous value.
One might assume that the larger \(/alpha\) is, the fastest convergence will be. Wrong assumption! A too large step-size may result with oscilations. Look at Figure 9.

Conclusion - step size should be carefully selected, and may be needed to tune during caluclations. Typical values are in the range of 0.001-1, which is quite a wide range. Normally you will make some iterations before selecting your learning rate.
Figure 10 presents an animation of Linear Regression modeling for a 1 dimenssional input data, using Gradient descent. The input data consists of 1024 points, generated by the equation: \(y=wx+b+noise\), where w=2 and b=1.

todo: Link: https://medium.com/@evertongomede/gradient-descent-a-fundamental-optimization-algorithm-95227f320f9c