Introduction¶

When I did my course work for Business Analytics course, the statistics course involved a lecture on showing how to solve Linear Regression using just excel and matrix multiplication using linear algebra which solves for oridinary least squares (OLS).

Since then I have forgotten how to solve it using matrix multiplication and I wanted learn how it is done as well demonstrate to others.

How to use this notebook:¶

See this youtube video which gives the intution for how this is solved. I have followed the equations and the solutions given in that youtube video and solved it using numpy
Fork this notebook and follow along each step with understanding

Sorry for not usng Latex for math equations, I had my notes on one note, so it was easier for me to put screenshots in this notebook.

Sections:¶

Math behind solving Linear Regression using Matrix Multiplication
Solving simple linear regression
Multiple linear regression
Calculation for r-square value
Conclusion
References

Please upvote this notebook if you found useful and/or informative!

Part 1: Math behind solving Linear Regression using Matrix Multiplication:¶

Find the best line through a set of data points:

What does best mean?¶

The best mean that we have to reduce the error (ε) which is a perpendicular line between the regression line and the data point (x)

OLS gives us the closed from solution in the form of the normal equations. Minimizing this sum of squared deviations is why the problem is called the Least Squares problem.

To formulate this as a matrix problem, we can write it as:

Such that:

Consider linear regression equation:

Muliplying by transpose of the X matrix on both the sides:

So, with the above equations we can calculate the intercept and coefficients with the equation:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings('ignore')

Part 2: Solving a simple linear regression:¶

Let us find the best line using least square for the set of data points: (1, 1), (2, 3), (3, 3), (4, 5)

Here the X = [1, 2, 3, 4] and y = [1, 3, 3, 5]

But we have to convert the X into a matrix so that we can have the β0 which is the intercept

X = np.matrix([[1, 1], 
               [1, 2],
              [1, 3],
              [1, 4]])
X

matrix([[1, 1],
        [1, 2],
        [1, 3],
        [1, 4]])

XT = np.matrix.transpose(X)
XT

matrix([[1, 1, 1, 1],
        [1, 2, 3, 4]])

y = np.matrix([[1], 
               [3],
              [3],
              [5]])
y

matrix([[1],
        [3],
        [3],
        [5]])

XT_X = np.matmul(XT, X)
XT_X

matrix([[ 4, 10],
        [10, 30]])

XT_y = np.matmul(XT, y)
XT_y

matrix([[12],
        [36]])

betas = np.matmul(np.linalg.inv(XT_X), XT_y)
betas

matrix([[0. ],
        [1.2]])

The best line is given by the equation y = 0 + 1.2 X

Ofcourse we can verify this with fitting it into a model as below.

from sklearn.linear_model import LinearRegression

regressor = LinearRegression().fit(X = np.array([1, 2, 3, 4]).reshape(-1, 1), y = [1, 3, 3, 5])
print("The intercept is: ", str(regressor.intercept_), ". Which is almost 0.")
print("The coefficient is: ", str(regressor.coef_))

The intercept is:  -8.881784197001252e-16 . Which is almost 0.
The coefficient is:  [1.2]

Part 3: Multiple linear regression¶

As we have seen for the simple linear regression part, the multiple linear regression is similar to that of the simple linear regression, but with more X variables, and hence we will have as many β as there are number of X variables.

The procedure for calculation for β is:

Add a first column with all 1's for the intercept
Take the transpose of X matrix
Multiply X transpose multipled X matrices
Find the inverse of this matrix
Multiply X transpose multipled y matrix
Multiply both the matrices to find the intercept and the coefficient

First I have done all the steps in an excel worksheet. The excel is available on Google Drive here.¶

Please download the google sheet, since there are some formatting issues on google drive that does not display the formulas correctly.

I have imported the data and taken only the first 300 rows for this analysis. Though this will work with any number of rows, since I did this first on excel worksheets before converting into python program, I felt comfortable to work with 300 records on the excel.

Step 1: Adding a column with intercept¶

data_vw = pd.read_csv("/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv")
data_vw = data_vw[:300]
data_vw["Intercept"] = 1
data_vw = data_vw[["Intercept", "year", "mileage", "tax", "mpg", "engineSize", "price"]]
print(data_vw.shape)
data_vw.head()

(300, 7)

np.set_printoptions(formatter={'float_kind':'{:f}'.format})
cross_tab = np.matmul(np.matrix.transpose(data_vw.values), data_vw.values)
cross_tab

array([[300.000000, 605633.000000, 2372687.000000, 43600.000000,
        14266.700000, 480.900000, 6736778.000000],
       [605633.000000, 1222637903.000000, 4789020011.000000,
        88018900.000000, 28800572.500000, 970857.100000,
        13600574239.000000],
       [2372687.000000, 4789020011.000000, 33034149683.000000,
        343313895.000000, 118674409.400000, 3536406.900000,
        48812233067.000000],
       [43600.000000, 88018900.000000, 343313895.000000, 6358750.000000,
        2070838.000000, 69961.000000, 980155565.000000],
       [14266.700000, 28800572.500000, 118674409.400000, 2070838.000000,
        691626.230000, 22686.260000, 313923137.200000],
       [480.900000, 970857.100000, 3536406.900000, 69961.000000,
        22686.260000, 806.090000, 11069969.700000],
       [6736778.000000, 13600574239.000000, 48812233067.000000,
        980155565.000000, 313923137.200000, 11069969.700000,
        156862347854.000000]])

Let us interpret what that matrix is:¶

X = data_vw[["Intercept", "year", "mileage", "tax", "mpg", "engineSize"]].values
y = data_vw[["price"]].values

Step 2: Taking the transpose of X matrix¶

XT = np.matrix.transpose(X)

Step 3: Multiply X transpose multipled X matrices¶

What we get from excel is:

XT_X = np.matmul(XT, X)
XT_X

array([[300.000000, 605633.000000, 2372687.000000, 43600.000000,
        14266.700000, 480.900000],
       [605633.000000, 1222637903.000000, 4789020011.000000,
        88018900.000000, 28800572.500000, 970857.100000],
       [2372687.000000, 4789020011.000000, 33034149683.000000,
        343313895.000000, 118674409.400000, 3536406.900000],
       [43600.000000, 88018900.000000, 343313895.000000, 6358750.000000,
        2070838.000000, 69961.000000],
       [14266.700000, 28800572.500000, 118674409.400000, 2070838.000000,
        691626.230000, 22686.260000],
       [480.900000, 970857.100000, 3536406.900000, 69961.000000,
        22686.260000, 806.090000]])

Step 4: Find the inverse of this matrix¶

XT_X_inv = np.linalg.inv(XT_X)
XT_X_inv

array([[64195.239751, -31.785379, -0.001468, 0.097988, -0.925057,
        8.549464],
       [-31.785379, 0.015738, 0.000001, -0.000052, 0.000455, -0.004264],
       [-0.001468, 0.000001, 0.000000, -0.000000, -0.000000, 0.000000],
       [0.097988, -0.000052, -0.000000, 0.000046, 0.000007, -0.000030],
       [-0.925057, 0.000455, -0.000000, 0.000007, 0.000109, 0.000100],
       [8.549464, -0.004264, 0.000000, -0.000030, 0.000100, 0.034864]])

Step 5: Multiply X transpose multipled y matrix¶

XT_y = np.matmul(XT, y)
XT_y

array([[6736778.000000],
       [13600574239.000000],
       [48812233067.000000],
       [980155565.000000],
       [313923137.200000],
       [11069969.700000]])

Step 6: Multiply both the matrices to find the intercept and the coefficient¶

betas = np.matmul(XT_X_inv, XT_y)
betas

array([[-1686494.715373],
       [852.845668],
       [-0.015301],
       [-20.821057],
       [-371.356381],
       [5023.590832]])

We can verify this with SM OLS below:

import statsmodels.api as sm

regressor = sm.OLS(y, X).fit()
print(regressor.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.761
Model:                            OLS   Adj. R-squared:                  0.757
Method:                 Least Squares   F-statistic:                     187.4
Date:                Fri, 31 Jul 2020   Prob (F-statistic):           3.45e-89
Time:                        08:08:45   Log-Likelihood:                -2721.7
No. Observations:                 300   AIC:                             5455.
Df Residuals:                     294   BIC:                             5478.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1.686e+06    5.4e+05     -3.126      0.002   -2.75e+06   -6.25e+05
x1           852.8457    267.143      3.192      0.002     327.090    1378.601
x2            -0.0153      0.024     -0.636      0.525      -0.063       0.032
x3           -20.8211     14.493     -1.437      0.152     -49.344       7.702
x4          -371.3564     22.232    -16.704      0.000    -415.110    -327.602
x5          5023.5908    397.606     12.635      0.000    4241.075    5806.106
==============================================================================
Omnibus:                        8.051   Durbin-Watson:                   1.846
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                7.983
Skew:                           0.352   Prob(JB):                       0.0185
Kurtosis:                       3.377   Cond. No.                     4.65e+07
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.65e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

Part 4: Calculation for r-square value:¶

If you have followed along till untill here, what follows will also be quite simple.

For any regression problem, finding just the interecept and the co-efficients is never good enough, we need to know how good the fit is.

The multiple coefficient of determination R2 measures the proportion of the variation in the dependent variable that is explained by the combination of the independent variables in the multiple regression model:

The calculstion for SST is:

We can take these values from the cross_tab variable above

yT_y = cross_tab[-1:, -1:]
n = cross_tab[:1, :1]
y_bar_square = np.square(cross_tab[:1, -1:])

SST = yT_y - (y_bar_square / n)
SST

array([[5581755116.386658]])

SSR is given by:

n = cross_tab[:1, :1]
y_bar_square = np.square(cross_tab[:1, -1:])

SSR = np.sum(np.multiply(betas, XT_y)) - (y_bar_square / n)
SSR

array([[4248620339.097961]])

r_square = SSR / SST
r_square

array([[0.761162]])

Which is the value that we get out of the sm.ols() method.

Conclusion¶

In this notebook, I have demonstrated how to solve a simple and multiple linear regression using just numpy and linear regression. Though most of them would never need to use this with availability of sophisticated packages, it is always cool to solve something from scratch to learn the intution behind the algorithms

Link to medium article of this notebook.

Please upvote this notebook if you found this useful or informative for you!

References:¶

https://www.youtube.com/watch?v=Lx6CfgKVIuE
Complete business statistics book
My course work for ISB CBA
https://towardsdatascience.com/qr-matrix-factorization-15bae43a6b2
https://en.wikipedia.org/wiki/Ordinary_least_squares

Image http://onlinestatbook.com/2/regression/intro.html

	Intercept	year	mileage	tax	mpg	engineSize	price
0	1	2019	13904	145	49.6	2.0	25000
1	1	2019	4562	145	49.6	2.0	26883
2	1	2019	7414	145	50.4	2.0	20000
3	1	2019	4825	145	32.5	2.0	33492
4	1	2019	6500	150	39.8	1.5	22900