Introduction

When I did my course work for Business Analytics course, the statistics course involved a lecture on showing how to solve Linear Regression using just excel and matrix multiplication using linear algebra which solves for oridinary least squares (OLS).

Since then I have forgotten how to solve it using matrix multiplication and I wanted learn how it is done as well demonstrate to others.

How to use this notebook:

  1. See this youtube video which gives the intution for how this is solved. I have followed the equations and the solutions given in that youtube video and solved it using numpy
  2. Fork this notebook and follow along each step with understanding

Sorry for not usng Latex for math equations, I had my notes on one note, so it was easier for me to put screenshots in this notebook.

Sections:

  1. Math behind solving Linear Regression using Matrix Multiplication
  2. Solving simple linear regression
  3. Multiple linear regression
  4. Calculation for r-square value
  5. Conclusion
  6. References

Please upvote this notebook if you found useful and/or informative!

Part 1: Math behind solving Linear Regression using Matrix Multiplication:

Find the best line through a set of data points: image.png

What does best mean?

image.png

The best mean that we have to reduce the error (ε) which is a perpendicular line between the regression line and the data point (x)

image.png

OLS gives us the closed from solution in the form of the normal equations. Minimizing this sum of squared deviations is why the problem is called the Least Squares problem.

To formulate this as a matrix problem, we can write it as:

image.png

Such that: image.png

Consider linear regression equation: image.png

Muliplying by transpose of the X matrix on both the sides: image.png

image.png

image.png

So, with the above equations we can calculate the intercept and coefficients with the equation: image.png

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings('ignore')

Part 2: Solving a simple linear regression:

Let us find the best line using least square for the set of data points: (1, 1), (2, 3), (3, 3), (4, 5)

Here the X = [1, 2, 3, 4] and y = [1, 3, 3, 5]

But we have to convert the X into a matrix so that we can have the β0 which is the intercept image.png

In [2]:
X = np.matrix([[1, 1], 
               [1, 2],
              [1, 3],
              [1, 4]])
X
Out[2]:
matrix([[1, 1],
        [1, 2],
        [1, 3],
        [1, 4]])

image.png

In [3]:
XT = np.matrix.transpose(X)
XT
Out[3]:
matrix([[1, 1, 1, 1],
        [1, 2, 3, 4]])

image.png

In [4]:
y = np.matrix([[1], 
               [3],
              [3],
              [5]])
y
Out[4]:
matrix([[1],
        [3],
        [3],
        [5]])

image.png

In [5]:
XT_X = np.matmul(XT, X)
XT_X
Out[5]:
matrix([[ 4, 10],
        [10, 30]])

image.png

In [6]:
XT_y = np.matmul(XT, y)
XT_y
Out[6]:
matrix([[12],
        [36]])

image.png

In [7]:
betas = np.matmul(np.linalg.inv(XT_X), XT_y)
betas
Out[7]:
matrix([[0. ],
        [1.2]])

image.png

The best line is given by the equation y = 0 + 1.2 X

Ofcourse we can verify this with fitting it into a model as below.

In [8]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression().fit(X = np.array([1, 2, 3, 4]).reshape(-1, 1), y = [1, 3, 3, 5])
print("The intercept is: ", str(regressor.intercept_), ". Which is almost 0.")
print("The coefficient is: ", str(regressor.coef_))
The intercept is:  -8.881784197001252e-16 . Which is almost 0.
The coefficient is:  [1.2]

Part 3: Multiple linear regression

As we have seen for the simple linear regression part, the multiple linear regression is similar to that of the simple linear regression, but with more X variables, and hence we will have as many β as there are number of X variables.

The procedure for calculation for β is:

  1. Add a first column with all 1's for the intercept
  2. Take the transpose of X matrix
  3. Multiply X transpose multipled X matrices
  4. Find the inverse of this matrix
  5. Multiply X transpose multipled y matrix
  6. Multiply both the matrices to find the intercept and the coefficient

First I have done all the steps in an excel worksheet. The excel is available on Google Drive here.

Please download the google sheet, since there are some formatting issues on google drive that does not display the formulas correctly.

I have imported the data and taken only the first 300 rows for this analysis. Though this will work with any number of rows, since I did this first on excel worksheets before converting into python program, I felt comfortable to work with 300 records on the excel.

Step 1: Adding a column with intercept

In [9]:
data_vw = pd.read_csv("/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv")
data_vw = data_vw[:300]
data_vw["Intercept"] = 1
data_vw = data_vw[["Intercept", "year", "mileage", "tax", "mpg", "engineSize", "price"]]
print(data_vw.shape)
data_vw.head()
(300, 7)
Out[9]:
Intercept year mileage tax mpg engineSize price
0 1 2019 13904 145 49.6 2.0 25000
1 1 2019 4562 145 49.6 2.0 26883
2 1 2019 7414 145 50.4 2.0 20000
3 1 2019 4825 145 32.5 2.0 33492
4 1 2019 6500 150 39.8 1.5 22900
In [10]:
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
cross_tab = np.matmul(np.matrix.transpose(data_vw.values), data_vw.values)
cross_tab
Out[10]:
array([[300.000000, 605633.000000, 2372687.000000, 43600.000000,
        14266.700000, 480.900000, 6736778.000000],
       [605633.000000, 1222637903.000000, 4789020011.000000,
        88018900.000000, 28800572.500000, 970857.100000,
        13600574239.000000],
       [2372687.000000, 4789020011.000000, 33034149683.000000,
        343313895.000000, 118674409.400000, 3536406.900000,
        48812233067.000000],
       [43600.000000, 88018900.000000, 343313895.000000, 6358750.000000,
        2070838.000000, 69961.000000, 980155565.000000],
       [14266.700000, 28800572.500000, 118674409.400000, 2070838.000000,
        691626.230000, 22686.260000, 313923137.200000],
       [480.900000, 970857.100000, 3536406.900000, 69961.000000,
        22686.260000, 806.090000, 11069969.700000],
       [6736778.000000, 13600574239.000000, 48812233067.000000,
        980155565.000000, 313923137.200000, 11069969.700000,
        156862347854.000000]])

Let us interpret what that matrix is:

image.png

In [11]:
X = data_vw[["Intercept", "year", "mileage", "tax", "mpg", "engineSize"]].values
y = data_vw[["price"]].values

Step 2: Taking the transpose of X matrix

In [12]:
XT = np.matrix.transpose(X)

Step 3: Multiply X transpose multipled X matrices

What we get from excel is: image.png

In [13]:
XT_X = np.matmul(XT, X)
XT_X
Out[13]:
array([[300.000000, 605633.000000, 2372687.000000, 43600.000000,
        14266.700000, 480.900000],
       [605633.000000, 1222637903.000000, 4789020011.000000,
        88018900.000000, 28800572.500000, 970857.100000],
       [2372687.000000, 4789020011.000000, 33034149683.000000,
        343313895.000000, 118674409.400000, 3536406.900000],
       [43600.000000, 88018900.000000, 343313895.000000, 6358750.000000,
        2070838.000000, 69961.000000],
       [14266.700000, 28800572.500000, 118674409.400000, 2070838.000000,
        691626.230000, 22686.260000],
       [480.900000, 970857.100000, 3536406.900000, 69961.000000,
        22686.260000, 806.090000]])

Step 4: Find the inverse of this matrix

In [14]:
XT_X_inv = np.linalg.inv(XT_X)
XT_X_inv
Out[14]:
array([[64195.239751, -31.785379, -0.001468, 0.097988, -0.925057,
        8.549464],
       [-31.785379, 0.015738, 0.000001, -0.000052, 0.000455, -0.004264],
       [-0.001468, 0.000001, 0.000000, -0.000000, -0.000000, 0.000000],
       [0.097988, -0.000052, -0.000000, 0.000046, 0.000007, -0.000030],
       [-0.925057, 0.000455, -0.000000, 0.000007, 0.000109, 0.000100],
       [8.549464, -0.004264, 0.000000, -0.000030, 0.000100, 0.034864]])

Step 5: Multiply X transpose multipled y matrix

In [15]:
XT_y = np.matmul(XT, y)
XT_y
Out[15]:
array([[6736778.000000],
       [13600574239.000000],
       [48812233067.000000],
       [980155565.000000],
       [313923137.200000],
       [11069969.700000]])

Step 6: Multiply both the matrices to find the intercept and the coefficient

In [16]:
betas = np.matmul(XT_X_inv, XT_y)
betas
Out[16]:
array([[-1686494.715373],
       [852.845668],
       [-0.015301],
       [-20.821057],
       [-371.356381],
       [5023.590832]])

We can verify this with SM OLS below:

In [17]:
import statsmodels.api as sm

regressor = sm.OLS(y, X).fit()
print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.761
Model:                            OLS   Adj. R-squared:                  0.757
Method:                 Least Squares   F-statistic:                     187.4
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.45e-89
Time:                        08:08:45   Log-Likelihood:                -2721.7
No. Observations:                 300   AIC:                             5455.
Df Residuals:                     294   BIC:                             5478.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.686e+06    5.4e+05     -3.126      0.002   -2.75e+06   -6.25e+05
x1           852.8457    267.143      3.192      0.002     327.090    1378.601
x2            -0.0153      0.024     -0.636      0.525      -0.063       0.032
x3           -20.8211     14.493     -1.437      0.152     -49.344       7.702
x4          -371.3564     22.232    -16.704      0.000    -415.110    -327.602
x5          5023.5908    397.606     12.635      0.000    4241.075    5806.106
==============================================================================
Omnibus:                        8.051   Durbin-Watson:                   1.846
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                7.983
Skew:                           0.352   Prob(JB):                       0.0185
Kurtosis:                       3.377   Cond. No.                     4.65e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.65e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Part 4: Calculation for r-square value:

If you have followed along till untill here, what follows will also be quite simple.

For any regression problem, finding just the interecept and the co-efficients is never good enough, we need to know how good the fit is.

The multiple coefficient of determination R2 measures the proportion of the variation in the dependent variable that is explained by the combination of the independent variables in the multiple regression model:

image.png

The calculstion for SST is: image.png

We can take these values from the cross_tab variable above

In [18]:
yT_y = cross_tab[-1:, -1:]
n = cross_tab[:1, :1]
y_bar_square = np.square(cross_tab[:1, -1:])

SST = yT_y - (y_bar_square / n)
SST
Out[18]:
array([[5581755116.386658]])

SSR is given by: image.png

In [19]:
n = cross_tab[:1, :1]
y_bar_square = np.square(cross_tab[:1, -1:])

SSR = np.sum(np.multiply(betas, XT_y)) - (y_bar_square / n)
SSR
Out[19]:
array([[4248620339.097961]])
In [20]:
r_square = SSR / SST
r_square
Out[20]:
array([[0.761162]])

Which is the value that we get out of the sm.ols() method.


Conclusion

In this notebook, I have demonstrated how to solve a simple and multiple linear regression using just numpy and linear regression. Though most of them would never need to use this with availability of sophisticated packages, it is always cool to solve something from scratch to learn the intution behind the algorithms

Link to medium article of this notebook.

Please upvote this notebook if you found this useful or informative for you!