In this analysis, I have done a basic EDA of features and I have selected k-best features out of both linear features and from polynomial features and have applied regression on top of it to find the maximum r_squared value that I am able to acheive from the data.
Importing the packages needed for the analysis. I usually like to import the packages in the alphabetical order, so that it is easy for reviewing if needed
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max.columns', None)
There are many files in the input folder for each of the car brands. We will import the file that is with VW naming on it.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
data_vw = pd.read_csv("/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv")
print(data_vw.shape)
data_vw.head()
Seeing if there are any missing values in the records
data_vw.isnull().sum()
Nice :) it is a nice and clean data, very good one to work with!
data_vw.describe()
sns.countplot(data_vw["transmission"])
Most of the cars on the dataset are with manual transmission with very few cars in automatic and seim automatic transmission
print(data_vw["model"].value_counts() / len(data_vw))
sns.countplot(y = data_vw["model"])
Top 3 cars are Golf, Polo and Tiguan on the dataset constuite 64% of all the VW cars, with all other cars contributing to 36%
sns.countplot(data_vw["fuelType"])
sns.countplot(y = data_vw["year"])
plt.figure(figsize=(15,5),facecolor='w')
sns.barplot(x = data_vw["year"], y = data_vw["price"])
The recently manufactured cars (year = 2018, 2019) are sold for more average price when compared to the cars that are manufactured earlier.
sns.barplot(x = data_vw["transmission"], y = data_vw["price"])
plt.figure(figsize=(15,10),facecolor='w')
sns.scatterplot(data_vw["mileage"], data_vw["price"], hue = data_vw["year"])
plt.figure(figsize=(15,5),facecolor='w')
sns.scatterplot(data_vw["mileage"], data_vw["price"], hue = data_vw["fuelType"])
sns.pairplot(data_vw)
Now I am computing a age field, subtracting 2020 from the year field and dropping the year field
data_vw["age_of_car"] = 2020 - data_vw["year"]
data_vw = data_vw.drop(columns = ["year"])
data_vw.sample(10)
I like to use pd.get_dummies option over OHE in SKLearn to get the one hot encoded variables for the categorical variables. It is usually tidy on the dataset and the column names are preserved.
data_vw_expanded = pd.get_dummies(data_vw)
data_vw_expanded.head()
Applying the standard scalar option to standardize all the variables in the dataset.
std = StandardScaler()
data_vw_expanded_std = std.fit_transform(data_vw_expanded)
data_vw_expanded_std = pd.DataFrame(data_vw_expanded_std, columns = data_vw_expanded.columns)
print(data_vw_expanded_std.shape)
data_vw_expanded_std.head()
X_train, X_test, y_train, y_test = train_test_split(data_vw_expanded_std.drop(columns = ['price']), data_vw_expanded_std[['price']])
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Since ther are 40 variables in the dataset after the one hot encoding, I am using SelectKBest option from sklearn to select the best features from the dataset for applying the regression.
For this, I am executing the SelectKBest() on f_regression by taking into consideration from 3 variables to 40 variables to see where we get the best score.
column_names = data_vw_expanded.drop(columns = ['price']).columns
no_of_features = []
r_squared_train = []
r_squared_test = []
for k in range(3, 40, 2):
selector = SelectKBest(f_regression, k = k)
X_train_transformed = selector.fit_transform(X_train, y_train)
X_test_transformed = selector.transform(X_test)
regressor = LinearRegression()
regressor.fit(X_train_transformed, y_train)
no_of_features.append(k)
r_squared_train.append(regressor.score(X_train_transformed, y_train))
r_squared_test.append(regressor.score(X_test_transformed, y_test))
sns.lineplot(x = no_of_features, y = r_squared_train, legend = 'full')
sns.lineplot(x = no_of_features, y = r_squared_test, legend = 'full')
We get score of 0.88 around 23 variables befor the curve stablizes. Hence keeping k as 23 selecting 23 best variables from the dataset
selector = SelectKBest(f_regression, k = 23)
X_train_transformed = selector.fit_transform(X_train, y_train)
X_test_transformed = selector.transform(X_test)
column_names[selector.get_support()]
def regression_model(model):
"""
Will fit the regression model passed and will return the regressor object and the score
"""
regressor = model
regressor.fit(X_train_transformed, y_train)
score = regressor.score(X_test_transformed, y_test)
return regressor, score
model_performance = pd.DataFrame(columns = ["Features", "Model", "Score"])
models_to_evaluate = [LinearRegression(), Ridge(), Lasso(), SVR(), RandomForestRegressor(), MLPRegressor()]
for model in models_to_evaluate:
regressor, score = regression_model(model)
model_performance = model_performance.append({"Features": "Linear","Model": model, "Score": score}, ignore_index=True)
model_performance
The best score we are getting is on a RandomForestRegressor() with a score of 0.9513
Fitting a linear regression model and checking the model parameters
regressor = sm.OLS(y_train, X_train).fit()
print(regressor.summary())
X_train_dropped = X_train.copy()
while True:
if max(regressor.pvalues) > 0.05:
drop_variable = regressor.pvalues[regressor.pvalues == max(regressor.pvalues)]
print("Dropping " + drop_variable.index[0] + " and running regression again because pvalue is: " + str(drop_variable[0]))
X_train_dropped = X_train_dropped.drop(columns = [drop_variable.index[0]])
regressor = sm.OLS(y_train, X_train_dropped).fit()
else:
print("All p values less than 0.05")
break
8 variables are dropped because p value is higher than our alpha level of 0.05. We fit the model with the remaining variables and see the summary below.
We can see a slight improvement over the linear regression in our earlier step with SKLearn fit which yielded a r_squared value of 0.87, this vies us a r_square value of 0.89
print(regressor.summary())
I would like to explore the dataset a bit further to see if a polynomial variable model is performing better on the same models.
I am using PolynomialFeatures() to engineer polynomial features from the dataset. We have around 820 features from PolynomialFeatures(), so again using SelectKBest to see how much is our optimum feature set size
poly = PolynomialFeatures()
X_train_transformed_poly = poly.fit_transform(X_train)
X_test_transformed_poly = poly.transform(X_test)
print(X_train_transformed_poly.shape)
no_of_features = []
r_squared = []
for k in range(10, 277, 5):
selector = SelectKBest(f_regression, k = k)
X_train_transformed = selector.fit_transform(X_train_transformed_poly, y_train)
regressor = LinearRegression()
regressor.fit(X_train_transformed, y_train)
no_of_features.append(k)
r_squared.append(regressor.score(X_train_transformed, y_train))
sns.lineplot(x = no_of_features, y = r_squared)
From the above graph we can see that we are hitting 0.93 score around 110 features.
selector = SelectKBest(f_regression, k = 110)
X_train_transformed = selector.fit_transform(X_train_transformed_poly, y_train)
X_test_transformed = selector.transform(X_test_transformed_poly)
models_to_evaluate = [LinearRegression(), Ridge(), Lasso(), SVR(), RandomForestRegressor(), MLPRegressor()]
for model in models_to_evaluate:
regressor, score = regression_model(model)
model_performance = model_performance.append({"Features": "Polynomial","Model": model, "Score": score}, ignore_index=True)
model_performance
I got maximum r^2 score of 0.955 for polynomian data on RandomForest regressor.
As next steps, I can concentrate on individual features, and make some transformations such as log transforms on each of the features to make the model perform even better.
Please upvote the notebook if you liked it, and leave me a feedback if you think something could have been better.