Multiple Linear Regression

Multiple Linear Regression

Since we have just begun studying machine learning, things are pretty simple at this level. In the previous article, we understood the workings of simple linear regression. Multiple linear regression works in the same way but there is just a slight difference in where it is used.

In simple linear regression, we had only one independent feature and only one dependent variable. However, in the case of multiple linear regression, we can have one dependent variable, but there are multiple independent features.

Understanding Multiple Linear Regression

Multiple Linear Regression is a supervised machine learning algorithm that computes the relationship between more than one independent features and a dependent variable by fitting an equation to data.

Let's understand this using an example,
We saw that the sales of ice-cream depended on the temperature in the article on simple linear regression. However, there are several other factors that might also play an important role in determining ice cream sales. This means it can also depend on location, demographics, marketing and promotion by the ice cream shop and so on.

To compute this relationship between the sales and various other factors affecting it, we have moved on from a simple linear regression to a multiple linear regression problem.

Therefore, the equation defining such a relationship would be,

$$y = β_0 + β_1.X_1 + β_2.X_2 + ... + β_n.X_n$$

Here,

  • y is the dependent variable

  • β0, β1, ..., βn are the coefficients of the features

  • X1, X2, ..., Xn are the independent features

  • n denotes the number of features we have

If we have two independent features and one dependent feature, we can visualize it in 3-dimensional space. Then you will get a plane that will help us predict the value. Check out the visualization give below here.

For higher dimensional space, you will get a hyperplane.

The cost function in this case is also given by,

$$J(m, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$

The cost function here means the difference between the actual and predicted values. We calculate it to understand how much error our model is making. Thus we try to minimize this cost function and reduce the errors caused by our model.

There are two methods to minimize this cost function:

  • Ordinary Least Square(OLS)(Direct method)

  • Gradient Descent(we'll study this in upcoming articles in detail)

We will study OLS for this case and try to deduce a formula that will give us the value of all the coefficients.

Mathematics Behind It

Let's dive straight into the maths:

$$\ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_n X_n$$

it can be written as,

Thus, for every data point, the predicted output is given by y_n and each of predicted point has its own equation.

$$\ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} \beta_0 +\beta_1X_{11} + \beta_2X_{12} + \ldots + \beta_mX_{1n} \\ \beta_0 + \beta_1X_{21} + \beta_2X_{22} + \ldots + \beta_mX_{2n} \\ \vdots \\ \beta_0 + \beta_1X_{n1} + \beta_2X_{n2} + \ldots + \beta_mX_{nm} \end{bmatrix}$$

$$\ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{bmatrix} \cdot \begin{bmatrix} 1 & X_{11} & X_{12} & \ldots & X_{1n} \\ 1 & X_{21} & X_{22} & \ldots & X_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{N1} & X_{N2} & \ldots & X_{Nn} \end{bmatrix}$$

Therefore, we can write the above matrix equation as,

$$Y = XB$$

We know that the error is

$$e = \begin{bmatrix} y_1-\hat \\y_1\\ y_2-\hat \\y_2 \\ \vdots \\y_n-\hat \\y_n \end{bmatrix}$$

Since, we take square of the error term, we can write it as,

$$E = e^Te = \begin{bmatrix} y_1-\hat \\y_1 \quad y_2-\hat \\y_2 \quad \cdots \quad y_n-\hat \\y_n \end{bmatrix} \begin{bmatrix} y_1-\hat \\y_1\\ y_2-\hat \\y_2 \\ \vdots \\y_n-\hat \\y_n \end{bmatrix}$$

which is simplified as,

$$\begin{align} E &= (Y - \hat{Y})^T(Y - \hat{Y}) \\ &= Y^T - \hat{Y}^T(Y - \hat{Y}) \\ &= [Y^T - {(XB)}^T](Y - XB)\\ &= Y^TY - Y^TXB -Y(XB)^T + (XB)^TXB \end{align}$$

In the above equation,

$$Y^T(XB)^T = (XB)^TY$$

You can prove the above by referring in to this Mathematics Stack Exchange page here.

Therefore, since they are equal we get,

$$\begin{align} E &= Y^TY-Y^TXB-(XB)^TY+(XB)^TXB\\ &=Y^TY-2Y^TXB+B^TX^TXB \end{align}$$

Now, differentiate the E,

$$\begin{align} \frac{\partial E}{\partial B} &= \frac{\partial (Y^T Y - 2Y^T X B + B^T X^T X B)}{\partial B} \\ &= 0-2Y^TX + 2X^TXB^T \\&= -2Y^TX + 2X^TXB^T \end{align}$$

Now,

$$\begin{align} \frac{\partial E}{\partial B} &= 0 \\ -2Y^TX+2X^TXB^T &= 0 \\ 2X^TXB^T &= 2Y^TX \\ B^T &= \frac{Y^TX}{X^TX}\\ B^T &= (X^TX)^{-1} (Y^TX) \end{align}$$

Let's transpose on both sides to get B matrix,

$$\begin{align} (B^T)^T &= [(X^TX)^{-1} (Y^TX)]^T \\\\ B &= [(X^TX)^{-1}]^T (Y^TX)^T\\ &= (X^TX)^{-1}X^TY \end{align}$$

Thus, we have the formula to calculate the B matrix which contains all the coefficients of the independent features.

You might wonder, why we don't use OLS always instead of gradient descent ?
In the given equation of B matrix, you can see there is an inverse term. The time complexity for calculating an inverse of a matrix is O(n^3), i.e. cube of n. Thus, as your n would certainly be a large value, calculating the inverse will take longer time. Thus, gradient descent technique helps us in converging to the solution in less amount of time.

Code From Scratch

This is how the multiple linear regression class would look like if you code from scratch.

class MeraLR:

    def __init__(self):
        self.coef_ = None
        self.intercept_ = None

    def fit(self,X_train,y_train):
        X_train = np.insert(X_train,0,1,axis=1)

        # calcuate the coeffs
        betas = np.linalg.inv(np.dot(X_train.T,X_train)).dot(X_train.T).dot(y_train)
        self.intercept_ = betas[0]
        self.coef_ = betas[1:]

    def predict(self,X_test):
        y_pred = np.dot(X_test,self.coef_) + self.intercept_
        return y_pred

Take a look at the following notebooks to see how to code from scratch. Try them out and see how they work.

When you study the kaggle notebooks, you can go on looking for more such code examples and then try creating your own. Look for an appropriate dataset on kaggle, understand it and train the model on the dataset. Practice and learn how things work!

Conclusion

Kudos to you! You have studied another algorithm and I hope it was enough for you to get started and explore while you experiment. Always remember that the more you work with new datasets, the more you will understand the topics in depth. As you practice you will come across many challenges that will help you learn and grow more. Feel free to comment if you have any doubts or simply want to add a note on this topic. Thanks for reading!

Did you find this article valuable?

Support Devyani Writes by becoming a sponsor. Any amount is appreciated!