Introduction:

VW cars

In this analysis, I have done a basic EDA of features and I have selected k-best features out of both linear features and from polynomial features and have applied regression on top of it to find the maximum r_squared value that I am able to acheive from the data.

  1. Introduction
  2. Importing dataset and exploration
  3. Exploratory data analysis
  4. Pre-processing for modeling
  5. Modeling
  6. Backward selection for variable selection on linear regression
  7. Polynomial features for modeling
  8. Conclusion

Importing the packages needed for the analysis. I usually like to import the packages in the alphabetical order, so that it is easy for reviewing if needed

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR

import statsmodels.api as sm

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max.columns', None)

Importing dataset and exploration

There are many files in the input folder for each of the car brands. We will import the file that is with VW naming on it.

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/bmw.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/merc.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/cclass.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/ford.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/unclean cclass.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/hyundi.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/vauxhall.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/audi.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/skoda.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/toyota.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/focus.csv
/kaggle/input/used-car-dataset-ford-and-mercedes/unclean focus.csv
In [3]:
data_vw = pd.read_csv("/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv")
print(data_vw.shape)
data_vw.head()
(15157, 9)
Out[3]:
model year price transmission mileage fuelType tax mpg engineSize
0 T-Roc 2019 25000 Automatic 13904 Diesel 145 49.6 2.0
1 T-Roc 2019 26883 Automatic 4562 Diesel 145 49.6 2.0
2 T-Roc 2019 20000 Manual 7414 Diesel 145 50.4 2.0
3 T-Roc 2019 33492 Automatic 4825 Petrol 145 32.5 2.0
4 T-Roc 2019 22900 Semi-Auto 6500 Petrol 150 39.8 1.5

Seeing if there are any missing values in the records

In [4]:
data_vw.isnull().sum()
Out[4]:
model           0
year            0
price           0
transmission    0
mileage         0
fuelType        0
tax             0
mpg             0
engineSize      0
dtype: int64

Nice :) it is a nice and clean data, very good one to work with!

In [5]:
data_vw.describe()
Out[5]:
year price mileage tax mpg engineSize
count 15157.000000 15157.000000 15157.000000 15157.000000 15157.000000 15157.000000
mean 2017.255789 16838.952365 22092.785644 112.744277 53.753355 1.600693
std 2.053059 7755.015206 21148.941635 63.482617 13.642182 0.461695
min 2000.000000 899.000000 1.000000 0.000000 0.300000 0.000000
25% 2016.000000 10990.000000 5962.000000 30.000000 46.300000 1.200000
50% 2017.000000 15497.000000 16393.000000 145.000000 53.300000 1.600000
75% 2019.000000 20998.000000 31824.000000 145.000000 60.100000 2.000000
max 2020.000000 69994.000000 212000.000000 580.000000 188.300000 3.200000

Exploratory data analysis

In [6]:
sns.countplot(data_vw["transmission"])
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d7d2a6d0>

Most of the cars on the dataset are with manual transmission with very few cars in automatic and seim automatic transmission

In [7]:
print(data_vw["model"].value_counts() / len(data_vw))
sns.countplot(y = data_vw["model"])
 Golf               0.320842
 Polo               0.216863
 Tiguan             0.116448
 Passat             0.060368
 Up                 0.058323
 T-Roc              0.048360
 Touareg            0.023949
 Touran             0.023224
 T-Cross            0.019793
 Golf SV            0.017682
 Sharan             0.017154
 Arteon             0.016362
 Scirocco           0.015966
 Amarok             0.007323
 Caravelle          0.006664
 CC                 0.006268
 Tiguan Allspace    0.006004
 Beetle             0.005476
 Shuttle            0.004025
 Caddy Maxi Life    0.003893
 Jetta              0.002111
 California         0.000990
 Caddy Life         0.000528
 Eos                0.000462
 Caddy              0.000396
 Fox                0.000264
 Caddy Maxi         0.000264
Name: model, dtype: float64
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d7871c90>

Top 3 cars are Golf, Polo and Tiguan on the dataset constuite 64% of all the VW cars, with all other cars contributing to 36%

In [8]:
sns.countplot(data_vw["fuelType"])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d5734a90>
In [9]:
sns.countplot(y = data_vw["year"])
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d565d750>
In [10]:
plt.figure(figsize=(15,5),facecolor='w') 
sns.barplot(x = data_vw["year"], y = data_vw["price"])
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d55e16d0>

The recently manufactured cars (year = 2018, 2019) are sold for more average price when compared to the cars that are manufactured earlier.

In [11]:
sns.barplot(x = data_vw["transmission"], y = data_vw["price"])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d558e890>
In [12]:
plt.figure(figsize=(15,10),facecolor='w') 
sns.scatterplot(data_vw["mileage"], data_vw["price"], hue = data_vw["year"])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d54a5050>
In [13]:
plt.figure(figsize=(15,5),facecolor='w') 
sns.scatterplot(data_vw["mileage"], data_vw["price"], hue = data_vw["fuelType"])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9d54e1fd0>
In [14]:
sns.pairplot(data_vw)
Out[14]:
<seaborn.axisgrid.PairGrid at 0x7fb9d535be90>

Now I am computing a age field, subtracting 2020 from the year field and dropping the year field

In [15]:
data_vw["age_of_car"] = 2020 - data_vw["year"]
data_vw = data_vw.drop(columns = ["year"])
data_vw.sample(10)
Out[15]:
model price transmission mileage fuelType tax mpg engineSize age_of_car
2888 Golf 10995 Semi-Auto 32741 Diesel 20 67.3 1.6 6
7801 Polo 10495 Manual 31697 Petrol 145 60.1 1.2 3
900 Golf 25990 Semi-Auto 6705 Petrol 145 37.7 2.0 1
14745 Amarok 39999 Automatic 2451 Diesel 260 33.6 3.0 0
6400 Passat 9795 Manual 97060 Diesel 20 67.3 2.0 4
3913 Golf 18299 Manual 7562 Petrol 145 45.6 1.5 1
15144 Caddy Maxi 9949 Automatic 93113 Diesel 160 52.3 1.6 5
7484 Polo 12999 Semi-Auto 31243 Petrol 20 61.4 1.4 3
4625 Golf 10393 Manual 44906 Diesel 0 74.3 1.6 4
5013 Golf 18490 Automatic 6289 Petrol 145 44.1 1.5 1

Pre-processing for modeling

I like to use pd.get_dummies option over OHE in SKLearn to get the one hot encoded variables for the categorical variables. It is usually tidy on the dataset and the column names are preserved.

In [16]:
data_vw_expanded = pd.get_dummies(data_vw)
data_vw_expanded.head()
Out[16]:
price mileage tax mpg engineSize age_of_car model_ Amarok model_ Arteon model_ Beetle model_ CC model_ Caddy model_ Caddy Life model_ Caddy Maxi model_ Caddy Maxi Life model_ California model_ Caravelle model_ Eos model_ Fox model_ Golf model_ Golf SV model_ Jetta model_ Passat model_ Polo model_ Scirocco model_ Sharan model_ Shuttle model_ T-Cross model_ T-Roc model_ Tiguan model_ Tiguan Allspace model_ Touareg model_ Touran model_ Up transmission_Automatic transmission_Manual transmission_Semi-Auto fuelType_Diesel fuelType_Hybrid fuelType_Other fuelType_Petrol
0 25000 13904 145 49.6 2.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0
1 26883 4562 145 49.6 2.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0
2 20000 7414 145 50.4 2.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0
3 33492 4825 145 32.5 2.0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
4 22900 6500 150 39.8 1.5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1

Applying the standard scalar option to standardize all the variables in the dataset.

In [17]:
std = StandardScaler()
data_vw_expanded_std = std.fit_transform(data_vw_expanded)
data_vw_expanded_std = pd.DataFrame(data_vw_expanded_std, columns = data_vw_expanded.columns)
print(data_vw_expanded_std.shape)
data_vw_expanded_std.head()
(15157, 40)
Out[17]:
price mileage tax mpg engineSize age_of_car model_ Amarok model_ Arteon model_ Beetle model_ CC model_ Caddy model_ Caddy Life model_ Caddy Maxi model_ Caddy Maxi Life model_ California model_ Caravelle model_ Eos model_ Fox model_ Golf model_ Golf SV model_ Jetta model_ Passat model_ Polo model_ Scirocco model_ Sharan model_ Shuttle model_ T-Cross model_ T-Roc model_ Tiguan model_ Tiguan Allspace model_ Touareg model_ Touran model_ Up transmission_Automatic transmission_Manual transmission_Semi-Auto fuelType_Diesel fuelType_Hybrid fuelType_Other fuelType_Petrol
0 1.052392 -0.387209 0.508120 -0.304459 0.864902 -0.849595 -0.085892 -0.128974 -0.074204 -0.079418 -0.0199 -0.02298 -0.016247 -0.062512 -0.031474 -0.081904 -0.021495 -0.016247 -0.687322 -0.134164 -0.045997 -0.253469 -0.526229 -0.127378 -0.13211 -0.063567 -0.1421 4.435993 -0.363036 -0.077718 -0.156643 -0.154194 -0.248868 2.594834 -1.280856 -0.576411 1.174175 -0.09828 -0.075981 -1.138035
1 1.295211 -0.828948 0.508120 -0.304459 0.864902 -0.849595 -0.085892 -0.128974 -0.074204 -0.079418 -0.0199 -0.02298 -0.016247 -0.062512 -0.031474 -0.081904 -0.021495 -0.016247 -0.687322 -0.134164 -0.045997 -0.253469 -0.526229 -0.127378 -0.13211 -0.063567 -0.1421 4.435993 -0.363036 -0.077718 -0.156643 -0.154194 -0.248868 2.594834 -1.280856 -0.576411 1.174175 -0.09828 -0.075981 -1.138035
2 0.407627 -0.694090 0.508120 -0.245816 0.864902 -0.849595 -0.085892 -0.128974 -0.074204 -0.079418 -0.0199 -0.02298 -0.016247 -0.062512 -0.031474 -0.081904 -0.021495 -0.016247 -0.687322 -0.134164 -0.045997 -0.253469 -0.526229 -0.127378 -0.13211 -0.063567 -0.1421 4.435993 -0.363036 -0.077718 -0.156643 -0.154194 -0.248868 -0.385381 0.780728 -0.576411 1.174175 -0.09828 -0.075981 -1.138035
3 2.147462 -0.816512 0.508120 -1.557966 0.864902 -0.849595 -0.085892 -0.128974 -0.074204 -0.079418 -0.0199 -0.02298 -0.016247 -0.062512 -0.031474 -0.081904 -0.021495 -0.016247 -0.687322 -0.134164 -0.045997 -0.253469 -0.526229 -0.127378 -0.13211 -0.063567 -0.1421 4.435993 -0.363036 -0.077718 -0.156643 -0.154194 -0.248868 2.594834 -1.280856 -0.576411 -0.851661 -0.09828 -0.075981 0.878707
4 0.781591 -0.737309 0.586884 -1.022843 -0.218101 -0.849595 -0.085892 -0.128974 -0.074204 -0.079418 -0.0199 -0.02298 -0.016247 -0.062512 -0.031474 -0.081904 -0.021495 -0.016247 -0.687322 -0.134164 -0.045997 -0.253469 -0.526229 -0.127378 -0.13211 -0.063567 -0.1421 4.435993 -0.363036 -0.077718 -0.156643 -0.154194 -0.248868 -0.385381 -1.280856 1.734874 -0.851661 -0.09828 -0.075981 0.878707
In [18]:
X_train, X_test, y_train, y_test = train_test_split(data_vw_expanded_std.drop(columns = ['price']), data_vw_expanded_std[['price']])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(11367, 39)
(3790, 39)
(11367, 1)
(3790, 1)

Modeling

Selecting best features for model

Since ther are 40 variables in the dataset after the one hot encoding, I am using SelectKBest option from sklearn to select the best features from the dataset for applying the regression.

For this, I am executing the SelectKBest() on f_regression by taking into consideration from 3 variables to 40 variables to see where we get the best score.

In [19]:
column_names = data_vw_expanded.drop(columns = ['price']).columns

no_of_features = []
r_squared_train = []
r_squared_test = []

for k in range(3, 40, 2):
    selector = SelectKBest(f_regression, k = k)
    X_train_transformed = selector.fit_transform(X_train, y_train)
    X_test_transformed = selector.transform(X_test)
    regressor = LinearRegression()
    regressor.fit(X_train_transformed, y_train)
    no_of_features.append(k)
    r_squared_train.append(regressor.score(X_train_transformed, y_train))
    r_squared_test.append(regressor.score(X_test_transformed, y_test))
    
sns.lineplot(x = no_of_features, y = r_squared_train, legend = 'full')
sns.lineplot(x = no_of_features, y = r_squared_test, legend = 'full')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9ce617f10>

We get score of 0.88 around 23 variables befor the curve stablizes. Hence keeping k as 23 selecting 23 best variables from the dataset

In [20]:
selector = SelectKBest(f_regression, k = 23)
X_train_transformed = selector.fit_transform(X_train, y_train)
X_test_transformed = selector.transform(X_test)
column_names[selector.get_support()]
Out[20]:
Index(['mileage', 'tax', 'mpg', 'engineSize', 'age_of_car', 'model_ Amarok',
       'model_ Arteon', 'model_ California', 'model_ Caravelle', 'model_ Polo',
       'model_ Sharan', 'model_ Shuttle', 'model_ T-Roc', 'model_ Tiguan',
       'model_ Tiguan Allspace', 'model_ Touareg', 'model_ Up',
       'transmission_Automatic', 'transmission_Manual',
       'transmission_Semi-Auto', 'fuelType_Diesel', 'fuelType_Hybrid',
       'fuelType_Petrol'],
      dtype='object')
In [21]:
def regression_model(model):
    """
    Will fit the regression model passed and will return the regressor object and the score
    """
    regressor = model
    regressor.fit(X_train_transformed, y_train)
    score = regressor.score(X_test_transformed, y_test)
    return regressor, score
In [22]:
model_performance = pd.DataFrame(columns = ["Features", "Model", "Score"])

models_to_evaluate = [LinearRegression(), Ridge(), Lasso(), SVR(), RandomForestRegressor(), MLPRegressor()]

for model in models_to_evaluate:
    regressor, score = regression_model(model)
    model_performance = model_performance.append({"Features": "Linear","Model": model, "Score": score}, ignore_index=True)

model_performance
Out[22]:
Features Model Score
0 Linear LinearRegression() 0.883684
1 Linear Ridge() 0.883686
2 Linear Lasso() -0.001168
3 Linear SVR() 0.938504
4 Linear (DecisionTreeRegressor(max_features='auto', ra... 0.952169
5 Linear MLPRegressor() 0.942873

The best score we are getting is on a RandomForestRegressor() with a score of 0.9513

Backward selection for variable selection on linear regression

Fitting a linear regression model and checking the model parameters

In [23]:
regressor = sm.OLS(y_train, X_train).fit()
print(regressor.summary())

X_train_dropped = X_train.copy()
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                  price   R-squared (uncentered):                   0.889
Model:                            OLS   Adj. R-squared (uncentered):              0.888
Method:                 Least Squares   F-statistic:                              2513.
Date:                Fri, 31 Jul 2020   Prob (F-statistic):                        0.00
Time:                        06:55:00   Log-Likelihood:                         -3711.5
No. Observations:               11367   AIC:                                      7495.
Df Residuals:                   11331   BIC:                                      7759.
Df Model:                          36                                                  
Covariance Type:            nonrobust                                                  
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
mileage                   -0.2089      0.005    -39.838      0.000      -0.219      -0.199
tax                       -0.0772      0.004    -18.385      0.000      -0.085      -0.069
mpg                       -0.1549      0.006    -27.087      0.000      -0.166      -0.144
engineSize                 0.4255      0.007     59.650      0.000       0.412       0.439
age_of_car                -0.3410      0.005    -64.543      0.000      -0.351      -0.331
model_ Amarok              0.0180      0.003      5.136      0.000       0.011       0.025
model_ Arteon              0.0490      0.003     15.623      0.000       0.043       0.055
model_ Beetle             -0.0091      0.003     -2.892      0.004      -0.015      -0.003
model_ CC                 -0.0249      0.003     -7.944      0.000      -0.031      -0.019
model_ Caddy              -0.0006      0.003     -0.174      0.862      -0.007       0.006
model_ Caddy Life         -0.0031      0.003     -0.893      0.372      -0.010       0.004
model_ Caddy Maxi          0.0034      0.003      1.063      0.288      -0.003       0.010
model_ Caddy Maxi Life    -0.0172      0.003     -4.918      0.000      -0.024      -0.010
model_ California          0.1397      0.003     47.263      0.000       0.134       0.145
model_ Caravelle           0.1905      0.003     58.630      0.000       0.184       0.197
model_ Eos                 0.0013      0.003      0.450      0.653      -0.004       0.007
model_ Fox                 0.0070      0.003      2.218      0.027       0.001       0.013
model_ Golf               -0.0274      0.002    -11.246      0.000      -0.032      -0.023
model_ Golf SV            -0.0334      0.003    -10.654      0.000      -0.040      -0.027
model_ Jetta              -0.0204      0.003     -6.607      0.000      -0.027      -0.014
model_ Passat             -0.0114      0.003     -3.651      0.000      -0.018      -0.005
model_ Polo               -0.1247      0.004    -34.840      0.000      -0.132      -0.118
model_ Scirocco           -0.0333      0.003    -10.390      0.000      -0.040      -0.027
model_ Sharan              0.0284      0.003      8.828      0.000       0.022       0.035
model_ Shuttle             0.0295      0.003      9.404      0.000       0.023       0.036
model_ T-Cross             0.0328      0.003     10.116      0.000       0.026       0.039
model_ T-Roc               0.0659      0.003     21.683      0.000       0.060       0.072
model_ Tiguan              0.1023      0.003     32.386      0.000       0.096       0.108
model_ Tiguan Allspace     0.0548      0.003     17.367      0.000       0.049       0.061
model_ Touareg             0.1045      0.004     25.636      0.000       0.097       0.113
model_ Touran              0.0405      0.003     13.234      0.000       0.034       0.046
model_ Up                 -0.1325      0.004    -37.508      0.000      -0.139      -0.126
transmission_Automatic     0.0329      0.003     12.109      0.000       0.028       0.038
transmission_Manual       -0.0570      0.002    -26.830      0.000      -0.061      -0.053
transmission_Semi-Auto     0.0384      0.002     17.116      0.000       0.034       0.043
fuelType_Diesel           -0.0828      0.003    -30.395      0.000      -0.088      -0.077
fuelType_Hybrid            0.1564      0.004     37.095      0.000       0.148       0.165
fuelType_Other             0.0182      0.003      5.890      0.000       0.012       0.024
fuelType_Petrol            0.0490      0.003     16.813      0.000       0.043       0.055
==============================================================================
Omnibus:                     2551.082   Durbin-Watson:                   2.050
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            15150.663
Skew:                           0.945   Prob(JB):                         0.00
Kurtosis:                       8.330   Cond. No.                     1.21e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.89e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [24]:
while True:
    if max(regressor.pvalues) > 0.05:
        drop_variable = regressor.pvalues[regressor.pvalues == max(regressor.pvalues)]
        print("Dropping " + drop_variable.index[0] + " and running regression again because pvalue is: " + str(drop_variable[0]))
        X_train_dropped = X_train_dropped.drop(columns = [drop_variable.index[0]])
        regressor = sm.OLS(y_train, X_train_dropped).fit()
    else:
        print("All p values less than 0.05")
        break
Dropping model_ Caddy and running regression again because pvalue is: 0.8615422166753985
Dropping model_ Passat and running regression again because pvalue is: 0.9125329548360342
Dropping model_ Caddy Life and running regression again because pvalue is: 0.5666150517208496
Dropping model_ Golf and running regression again because pvalue is: 0.47965914705962986
Dropping model_ Eos and running regression again because pvalue is: 0.38949430132000473
Dropping model_ Caddy Maxi and running regression again because pvalue is: 0.17901474935694434
Dropping model_ Beetle and running regression again because pvalue is: 0.12094753413132125
All p values less than 0.05

8 variables are dropped because p value is higher than our alpha level of 0.05. We fit the model with the remaining variables and see the summary below.

We can see a slight improvement over the linear regression in our earlier step with SKLearn fit which yielded a r_squared value of 0.87, this vies us a r_square value of 0.89

In [25]:
print(regressor.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                  price   R-squared (uncentered):                   0.889
Model:                            OLS   Adj. R-squared (uncentered):              0.888
Method:                 Least Squares   F-statistic:                              3016.
Date:                Fri, 31 Jul 2020   Prob (F-statistic):                        0.00
Time:                        06:55:00   Log-Likelihood:                         -3714.4
No. Observations:               11367   AIC:                                      7489.
Df Residuals:                   11337   BIC:                                      7709.
Df Model:                          30                                                  
Covariance Type:            nonrobust                                                  
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
mileage                   -0.2078      0.005    -40.043      0.000      -0.218      -0.198
tax                       -0.0773      0.004    -18.507      0.000      -0.086      -0.069
mpg                       -0.1552      0.006    -27.231      0.000      -0.166      -0.144
engineSize                 0.4256      0.007     59.786      0.000       0.412       0.440
age_of_car                -0.3421      0.005    -65.669      0.000      -0.352      -0.332
model_ Amarok              0.0227      0.004      6.446      0.000       0.016       0.030
model_ Arteon              0.0562      0.003     17.443      0.000       0.050       0.063
model_ CC                 -0.0204      0.003     -6.449      0.000      -0.027      -0.014
model_ Caddy Maxi Life    -0.0137      0.004     -3.885      0.000      -0.021      -0.007
model_ California          0.1415      0.003     47.723      0.000       0.136       0.147
model_ Caravelle           0.1950      0.003     59.183      0.000       0.189       0.201
model_ Fox                 0.0080      0.003      2.529      0.011       0.002       0.014
model_ Golf SV            -0.0258      0.003     -7.918      0.000      -0.032      -0.019
model_ Jetta              -0.0178      0.003     -5.735      0.000      -0.024      -0.012
model_ Polo               -0.1003      0.005    -22.283      0.000      -0.109      -0.091
model_ Scirocco           -0.0261      0.003     -7.940      0.000      -0.033      -0.020
model_ Sharan              0.0358      0.003     10.750      0.000       0.029       0.042
model_ Shuttle             0.0331      0.003     10.452      0.000       0.027       0.039
model_ T-Cross             0.0409      0.003     11.969      0.000       0.034       0.048
model_ T-Roc               0.0784      0.003     23.261      0.000       0.072       0.085
model_ Tiguan              0.1206      0.004     32.378      0.000       0.113       0.128
model_ Tiguan Allspace     0.0592      0.003     18.515      0.000       0.053       0.065
model_ Touareg             0.1130      0.004     27.600      0.000       0.105       0.121
model_ Touran              0.0491      0.003     15.284      0.000       0.043       0.055
model_ Up                 -0.1185      0.004    -30.371      0.000      -0.126      -0.111
transmission_Automatic     0.0331      0.003     12.185      0.000       0.028       0.038
transmission_Manual       -0.0573      0.002    -27.001      0.000      -0.061      -0.053
transmission_Semi-Auto     0.0385      0.002     17.190      0.000       0.034       0.043
fuelType_Diesel           -0.0825      0.003    -30.633      0.000      -0.088      -0.077
fuelType_Hybrid            0.1568      0.004     37.424      0.000       0.149       0.165
fuelType_Other             0.0182      0.003      5.887      0.000       0.012       0.024
fuelType_Petrol            0.0485      0.003     16.926      0.000       0.043       0.054
==============================================================================
Omnibus:                     2547.532   Durbin-Watson:                   2.049
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            15119.832
Skew:                           0.944   Prob(JB):                         0.00
Kurtosis:                       8.325   Cond. No.                     4.30e+15
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.3e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Fitting on polynomial features

I would like to explore the dataset a bit further to see if a polynomial variable model is performing better on the same models.

I am using PolynomialFeatures() to engineer polynomial features from the dataset. We have around 820 features from PolynomialFeatures(), so again using SelectKBest to see how much is our optimum feature set size

In [26]:
poly = PolynomialFeatures()
X_train_transformed_poly = poly.fit_transform(X_train)
X_test_transformed_poly = poly.transform(X_test)

print(X_train_transformed_poly.shape)

no_of_features = []
r_squared = []

for k in range(10, 277, 5):
    selector = SelectKBest(f_regression, k = k)
    X_train_transformed = selector.fit_transform(X_train_transformed_poly, y_train)
    regressor = LinearRegression()
    regressor.fit(X_train_transformed, y_train)
    no_of_features.append(k)
    r_squared.append(regressor.score(X_train_transformed, y_train))
    
sns.lineplot(x = no_of_features, y = r_squared)
(11367, 820)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb9ce678690>

From the above graph we can see that we are hitting 0.93 score around 110 features.

In [27]:
selector = SelectKBest(f_regression, k = 110)
X_train_transformed = selector.fit_transform(X_train_transformed_poly, y_train)
X_test_transformed = selector.transform(X_test_transformed_poly)
In [28]:
models_to_evaluate = [LinearRegression(), Ridge(), Lasso(), SVR(), RandomForestRegressor(), MLPRegressor()]

for model in models_to_evaluate:
    regressor, score = regression_model(model)
    model_performance = model_performance.append({"Features": "Polynomial","Model": model, "Score": score}, ignore_index=True)

model_performance
Out[28]:
Features Model Score
0 Linear LinearRegression() 0.883684
1 Linear Ridge() 0.883686
2 Linear Lasso() -0.001168
3 Linear SVR() 0.938504
4 Linear (DecisionTreeRegressor(max_features='auto', ra... 0.952169
5 Linear MLPRegressor() 0.942873
6 Polynomial LinearRegression() -2.665967
7 Polynomial Ridge() 0.926233
8 Polynomial Lasso() 0.127168
9 Polynomial SVR() 0.944961
10 Polynomial (DecisionTreeRegressor(max_features='auto', ra... 0.957020
11 Polynomial MLPRegressor() 0.919230

Conclusion:

I got maximum r^2 score of 0.955 for polynomian data on RandomForest regressor.

As next steps, I can concentrate on individual features, and make some transformations such as log transforms on each of the features to make the model perform even better.

Please upvote the notebook if you liked it, and leave me a feedback if you think something could have been better.