When I did my course work for Business Analytics course, the statistics course involved a lecture on showing how to solve Linear Regression using just excel and matrix multiplication using linear algebra which solves for oridinary least squares (OLS).
Since then I have forgotten how to solve it using matrix multiplication and I wanted learn how it is done as well demonstrate to others.
Sorry for not usng Latex for math equations, I had my notes on one note, so it was easier for me to put screenshots in this notebook.
Please upvote this notebook if you found useful and/or informative!
Find the best line through a set of data points:
The best mean that we have to reduce the error (ε) which is a perpendicular line between the regression line and the data point (x)
OLS gives us the closed from solution in the form of the normal equations. Minimizing this sum of squared deviations is why the problem is called the Least Squares problem.
To formulate this as a matrix problem, we can write it as:
Such that:
Consider linear regression equation:
Muliplying by transpose of the X matrix on both the sides:
So, with the above equations we can calculate the intercept and coefficients with the equation:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
Let us find the best line using least square for the set of data points: (1, 1), (2, 3), (3, 3), (4, 5)
Here the X = [1, 2, 3, 4] and y = [1, 3, 3, 5]
But we have to convert the X into a matrix so that we can have the β0 which is the intercept
X = np.matrix([[1, 1],
[1, 2],
[1, 3],
[1, 4]])
X
XT = np.matrix.transpose(X)
XT
y = np.matrix([[1],
[3],
[3],
[5]])
y
XT_X = np.matmul(XT, X)
XT_X
XT_y = np.matmul(XT, y)
XT_y
betas = np.matmul(np.linalg.inv(XT_X), XT_y)
betas
The best line is given by the equation y = 0 + 1.2 X
Ofcourse we can verify this with fitting it into a model as below.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression().fit(X = np.array([1, 2, 3, 4]).reshape(-1, 1), y = [1, 3, 3, 5])
print("The intercept is: ", str(regressor.intercept_), ". Which is almost 0.")
print("The coefficient is: ", str(regressor.coef_))
As we have seen for the simple linear regression part, the multiple linear regression is similar to that of the simple linear regression, but with more X variables, and hence we will have as many β as there are number of X variables.
The procedure for calculation for β is:
Please download the google sheet, since there are some formatting issues on google drive that does not display the formulas correctly.
I have imported the data and taken only the first 300 rows for this analysis. Though this will work with any number of rows, since I did this first on excel worksheets before converting into python program, I felt comfortable to work with 300 records on the excel.
data_vw = pd.read_csv("/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv")
data_vw = data_vw[:300]
data_vw["Intercept"] = 1
data_vw = data_vw[["Intercept", "year", "mileage", "tax", "mpg", "engineSize", "price"]]
print(data_vw.shape)
data_vw.head()
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
cross_tab = np.matmul(np.matrix.transpose(data_vw.values), data_vw.values)
cross_tab
X = data_vw[["Intercept", "year", "mileage", "tax", "mpg", "engineSize"]].values
y = data_vw[["price"]].values
XT = np.matrix.transpose(X)
What we get from excel is:
XT_X = np.matmul(XT, X)
XT_X
XT_X_inv = np.linalg.inv(XT_X)
XT_X_inv
XT_y = np.matmul(XT, y)
XT_y
betas = np.matmul(XT_X_inv, XT_y)
betas
We can verify this with SM OLS below:
import statsmodels.api as sm
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
If you have followed along till untill here, what follows will also be quite simple.
For any regression problem, finding just the interecept and the co-efficients is never good enough, we need to know how good the fit is.
The multiple coefficient of determination R2 measures the proportion of the variation in the dependent variable that is explained by the combination of the independent variables in the multiple regression model:
The calculstion for SST is:
We can take these values from the cross_tab variable above
yT_y = cross_tab[-1:, -1:]
n = cross_tab[:1, :1]
y_bar_square = np.square(cross_tab[:1, -1:])
SST = yT_y - (y_bar_square / n)
SST
SSR is given by:
n = cross_tab[:1, :1]
y_bar_square = np.square(cross_tab[:1, -1:])
SSR = np.sum(np.multiply(betas, XT_y)) - (y_bar_square / n)
SSR
r_square = SSR / SST
r_square
Which is the value that we get out of the sm.ols() method.
In this notebook, I have demonstrated how to solve a simple and multiple linear regression using just numpy and linear regression. Though most of them would never need to use this with availability of sophisticated packages, it is always cool to solve something from scratch to learn the intution behind the algorithms
Link to medium article of this notebook.
Please upvote this notebook if you found this useful or informative for you!