Supervised Machine Learning can be splitted into 2 main branches: Classification and Regression. In Classification, the data is grouped into a descrete set of classes, e.g. classify a picture to one of [cat, dog, lion], or decide if to buy or not to buy a stock. In Regression, the output is a continious value, e.g. predict a trip time, a weight of a baby, a price of a house, etc.
The supervised model, runs through 3 phases,as depicted in Figure 1, a-c.
Figure 1 presents a system with n-dimensional input (AKA n features), denoted by\(x_1-x_n\), and the predictor’s corresponding n+1 coefficients, denoted by \(b\) - the bias coefficient, and \(w_1-w_n\).
Here’s a brief description of the 3 phases:
Training Phase - During the Training phase, the predictor’s coefficients are calculated by the model fitter, as depicted by Figure 1-a. The Training data consists of m examples, denoted by superscripts 1-m. The fitter calculates the predictor’s coefficents. The common way to do it is by a cost function minimization: That cost function expresses the error between the ground truth label value \(y\) and the predictor model value \(\hat{y}\), and the an optimization algorithm searches for the coefficents with which the minimum cost is achieved. The otimization algorithm is activated iteratively with a fixed set of labeled examples while the weights vary after each iteration. Alternatively, in case of a Linear Predictor, it is possible to analytically find a the coefficients. And indeed, this post presnets the analytical model fitting. The optimizer model fitting, using the Gradient Descent optimization algorithm, is presented in a post which follows.
Validation Phase - When fitting is done, having the predictor’s coefficients, the predictor’s performance is validated, as depicted in Figure 1-b. Unlike Training phase, during this phase, the weights are fixed, and the input data varies - k input labeled examples, which were not included in the Training data set. The resultant output is validated using a cost function. In case results are not satisfactory, the Training phase should be revisited. As noted, the validation’s k labeled examples input data set should not be a part of the training dataset, otherwise, problems such as overfitting would might notbe observed. We will detail overfitting and underfitting phenomenas in one of the next posts.
Normal Data Phase - During the Data phase, the system predicts the output for the input data, using the fitted model.
This post is part of an intro to Regression Supervised Machine Learning. It presents the Linear Prediction model for solving Reression problems, and the Analytical Solution for determining the predictor’s coefficents.
**The Training input Dataset consists of m labeled examples, each of n features.**
**The Validation input Dataset consists of k labeled examples, each of n features.**
The predictions \(\hat{y}=b+\sum_{j=1}^{n}w_jx_j\) are validated by checking the cost function.
**The Normal Data input Dataset consists n features.**
\(\hat{y}=b+\sum_{j=1}^{n}w_jx_j \)
The first step towards solving the problem, should be the selection of the prediction model. This post presents the Linear Predictor aka Linear Regression
At the name Linear Predictor implies, the predicted output is modeled by a linear combination of the input dataset, as shown in Eq. 1. According to our notations, \(\hat{y}\) represents the the estimated value of \(y\), \({x_j}\) represent the input data set, where j=1:n, and \({b, w_j}\), i=1:n are the predictor’s coeffcients. The estimation residual, AKA error is denoted by e.
Where b stands for bias, and w for weight.
Let’s return to our homes prices prediction example, but for our convinience, simplify it for now, reduce the number of input features to n=1. This reduction will simplify the comutation illustration and graphical presentation of the problem, but we will not use generality and develope thesolution for any n.
With that in mind, watch Table 1, which holds a set of 5 example data points, each relates to an appartment which is charectarized by its floor, and labeled by its price. Based on those examples, we should set a model which will predict apartments price, given its floor. (Obviously the floor by itself is not sufficient for a valid apartment price prediction, but still, we take that for the sake of example’s simplicity, but without losing generality - we will show the solution for any n. Besides the too-small number of features,the number of examples, m=5, is obviously too small for a good prediction, but again, it’s all for the sake of the example’s simplicity.
| Floor | Price (M $) |
|---|---|
| 0 | 0.4 |
| 1 | 0.7 |
| 2 | 1.1 |
| 3 | 1.4 |
| 4 | 1.65 |
For Table 1’s single feature data, the predictor equation of Eq. 1, reduces to the equation specified in Eq 2A. Pluging the Table 1’s data to Eq. 2a, results in Eq. 2b.
Inserting table 1’s data to Eq. 2a gives the following set of equations:
\(b+w_1*0+\epsilon^{(1)}=0.4\)
\[b+w_1*1+\epsilon^{(2)}=0.7\] \[b+w_1*2+\epsilon^{(3)}=1.1\] \[b+w_1*3+\epsilon^{(4)}=1.4\] \[b+w_1*4+\epsilon^{(5)}=1.65\]From the above equations the coefficients \(b,w_1\) can be found. We will get to solving the equations later in this post.
Figure 2 presents a graph of Table 1’s 5 data points, along with a line regression estimator graph.

According to the graph sketched in Figure 2, the linear model indeed seems to fit in as an estimator for the data points. To prove a validitiy of a model though, one needs much more than 5 example data points. Needles to repeat here on the reasons for our seletion.
Have not yet explained how the fitting line was calculated, butlet’s examine a higher order estimator, with n=2. So accordingly, Let’s have now 2 features - floor number and number of bedrooms, as shown in Table 2.
| Floor | Bedrooms | Price (M $) |
|---|---|---|
| 0 | 5 | 1.5 |
| 1 | 4 | 0.5 |
| 2 | 5 | 1 |
| 3 | 6 | 1.2 |
| 4 | 6 | 1.5 |
| 5 | 4 | 0.5 |
| 6 | 5 | 0.6 |
| 7 | 3 | 0.9 |
| 8 | 4 | 0.7 |
| 9 | 3 | 0.7 |
| 10 | 5 | 1.3 |
Still now, these 2 features are not enough for a valid prediction of apartments pricess, nor is the number of the example points. But that doesn’t matter at this point. Anyway, with n<=2, it is still possible to present the datapoints and the predictor graphically, as shown in Figure 3. Figure 3 presents the listed above dataset examples, (see x marker points), and based on those points, it presents a linear predictor points, which where calculated with the predictor expression of Eq. 3, which is same as Eq. 1, but with n=2.
As before, i=1:m, where m=11 is the number of examples. Nore that the Linear Predictor’s calculated \(\hat{y^{(i)}}\) points, are located on the colored presented surface of Fig 2.
Let’s plug table 2’s data to Eq. 3, replacing \(y^{(i)}\) and \(x^{(i)}\) by the actual data observations, and get following set of equations:
\(b+w{_1}*0+w{_2}*5+\epsilon^{(1)}=1.5\) \(b+w{_1}*1+w{_2}*4+\epsilon^{(2)}=0.5\) \(b+w{_1}*2+w{_2}*5+\epsilon^{(3)}=1.0\) \(b+w{_1}*3+w{_2}*6+\epsilon^{(4)}=1.2\) \(b+w{_1}*4+w{_2}*6+\epsilon^{(5)}=1.5\) \(b+w{_1}*5+w{_2}*4+\epsilon^{(6)}=0.5\) \(b+w{_1}*6+w{_2}*5+\epsilon^{(7)}=0.6\) \(b+w{_1}*7+w{_2}*3+\epsilon^{(8)}=0.9\) \(b+w{_1}*8+w{_2}*4+\epsilon^{(9)}=0.7\) \(b+w{_1}*9+w{_2}*3+\epsilon^{(10)}=0.7\) \(b+w{_1}*10+w{_2}*5+\epsilon^{(11)}=1.3\)
From the above 11 equations, it is possible to calculate the coefficients \( {b,w_1, w_2} \). We will get to solving the equations in the next paragraph of this post.

As already stated above, the common way for model fitting, i.e. finding the predictor’s coeffcients, is by a cost function minimization, e.g. by the Gradient Descent algorithm. This will be reviewed in a post which follows. This post presnets the analytical model fitting. This chapter presents the mathematical equation development of this solution. The solution is illustrated by Figure 4.
So, we start the equation development by re-writing Eq. 1 in a matrix form, as shown in Eq. 3. x and y are presented in Eq. 4a and Eq. 4b respectively. The unknown vector w is presented by Eq. 4c. Eq.5 is the result of plugging Eq. 4a-Eq. 4c to Eq. 3.
x matrix holds the m input set examples, where each inpt set consists of n features
\[x=\begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & . & . & x_n^{(1)}\\\\\ 1 & x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & . & . & x_n^{(2)}\\\\\ .& .& .& .& \\\\\ .& .& .& .& \\\\\ .& .& .& .& \\\\\ 1 & x_1^{(m)} & x_2^{(m)} & x_3^{(m)} & . & . & x_n^{(m)} \end{bmatrix}\]y vector holds the m training examples observed output, AKA labels
\[y=\begin{bmatrix} y^{(1)} \\\\ y^{(2)} \\\\ . \\\\ . \\\\ . \\\\ y^{(m)} \end{bmatrix}\]w vector is the unknown coefficients vector we aim to find. The first entry of the vector is b, the bias entry.
\[b=\begin{bmatrix} b\\\\\ w_1\\\\\ .\\\\ .\\\\\ . \\\\\ w_n \end{bmatrix}\]Eq. 5 is listed here to clarify the all Eq. 3 components. Having that, let’s solve Eq. 3, but first lets drop the prediction error from the equation, as this error exists, but we consider it as random independent additive noise.
Now let’s do some basic well-known Linear Algebra gimnastics:
Matrix X dimensions are mXn, where m » n, i.e. m, the number of examples, should be be much greater than n, the input’s dimensions. Accordingly, considering X is not square, it is not invertible. Still, if X is Full Rank, i.e. it’s columns are linear independent, then \(X^TX\) is invertible. So multiply each side of Eq. 6 by \(X^T\):
Multiply each side of Eq. 7 by \((X^TX)^{-1}\) :
Since \((X^TX)^{-1}X^TX\)=I Eq. 8 reduces to:
To sum up the solution, we Sample some training data and run it through the network to make predictions. Measure the loss between the predictions and the true values. Finally, adjust the weights in a direction that makes the loss smaller.
Let’s illustrate Eq. 8 on the 1D predictor listed above. Let’s plug Eq.2 into the components of Eq. 9, as shown in Eq. 10.
Inverse of a 2X2 matrix is given by:
\[\begin{bmatrix} a &b \\\\\ c & d \end{bmatrix}^{-1}=\frac{1}{\begin{vmatrix} d & -b \\\\\ -c & a \end{vmatrix}}*\begin{bmatrix} d &-b \\\\\ -c & a \end{bmatrix}\]The denominator is a determinant:
\[\begin{vmatrix} d & -b \\\\\ -c & a \end{vmatrix} = da-bc\]Combine Eq. 13 and Eq. 12:
We have now all the building block to complete the calculation:
The predictor’s coefficients are \(b=0.41\) and \(w_1=0.32\), which define the regression line: \(\hat{y}=0.41+0.32x\).
The line predictor sketched in Figure 1.
To conclude, Eq. 9 provides the analytical solution, which determines the prediction’s coefficients \W=( b, w_j…w_n \), based on the set of labeled data \( X=({x_i}^{(2)}…{x_n}^{(i)}), Y=y^{(i)} \) where i is the example’s index, running from 1 to m.
Here’s Eq. 9 again: \((X^TX)^{-1}X^TY=W\) where \(W = \begin{bmatrix} b\\\\\ w_1\\\\\ w_2\\\\\ .\\\\\ .\\\\\ w_n \end{bmatrix}\)
The drawbacks of the analyical solution is the need to inverse natrix \(X^TX\), which dimension are (n+1)*(n+1). So when the number of features is large, say overa few hundrerds or so, depending available computation power, the analytical solution might be too expensive. Alternatively, the solution may be calculated using Gradient Descent, an optimized iterative solution. Next post introduces the Gradient Descent algorithm, and illustrates calculation Linear Predictor’s coefficents.