Introduction:¶

This is my first Kernel on Kaggle, I wanted to build foundation with a simple classification model.

I choose this dataset because it is clean and simple, with less number of variables and observations, an ideal dataset for me to work on.

I have structured the notebook into the following tasks:

Importing and exploring the dataset
EDA on the dataset
Defining classification labels
Modelling
Conclusion
References

Importing and exploring the datasets¶

For importing the packages, I just like to put them in alphabetical order of the package, so that it is easy to manage and review if needed

#Importing the necessary packages

import collections

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

import scipy
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, plot_roc_curve, accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')

#Reading the datasets
data = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv")
data.head()

data.describe()

data.isnull().sum()

Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64

Good that there are no missing values in the dataset. It makes our data pre-processing very much easier.

Now let's check the correlation between the variables.

corr = data.corr()
corr.style.background_gradient(cmap='coolwarm')

#Removing the serial number column as it adds no correlation to any columns
data = data.drop(columns = ["Serial No."])

#The column "Chance of Admit" has a trailing space which is removed
data = data.rename(columns={"Chance of Admit ": "Chance of Admit"})

data.head()

Exploratory data analysis¶

The main EDA that I performed on this dataset is to see how the variables are distributed, to check if the variables are distributed normally. For that the pair plot is used to check the histogram of the variables as well as for the scatter plot to see how the variables are corelated to each other.

plt.hist(data["Chance of Admit"])
plt.xlabel("Chance of Admit")
plt.ylabel("Count")
plt.show()

sns.pairplot(data)

<seaborn.axisgrid.PairGrid at 0x7fc32f15f810>

sns.kdeplot(data["Chance of Admit"], data["GRE Score"], cmap="Blues", shade=True, shade_lowest=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fc32cee1590>

sns.kdeplot(data["Chance of Admit"], data["University Rating"], cmap="Blues", shade=True, shade_lowest=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fc32ac8f9d0>

sns.kdeplot(data["GRE Score"], data["University Rating"], cmap="Blues", shade=True, shade_lowest=False)

<matplotlib.axes._subplots.AxesSubplot at 0x7fc32ac14610>

sns.scatterplot(data["GRE Score"], data["University Rating"])

<matplotlib.axes._subplots.AxesSubplot at 0x7fc32ab96450>

Defining the class labels for classification¶

For ease of working with the classifier, it will be nice to have a 50/50 split on the data.

For class balance, let us assume that the bottom 50% of the observations fall in class 0 (no or less chance of admit), and the top 50% of the observations fall in class 1.

Binning the Chance of Admit variable and seeing where the 50% lies

collections.Counter([i-i%0.1+0.1 for i in data["Chance of Admit"]])

Counter({1.0: 61,
         0.8: 132,
         0.9: 94,
         0.7000000000000001: 116,
         0.5: 31,
         0.6: 58,
         0.4: 8})

data['Label'] = np.where(data["Chance of Admit"] <= 0.72, 0, 1)
print(data['Label'].value_counts())
data.sample(10)

0    252
1    248
Name: Label, dtype: int64

We now have 252 observations in class 0 and 248 observations in class 1, which is good enough balance that we are expecting

Checking variable importance¶

Let us now check what variables are important for out labels. For checking variable importance, we will use a basic decision tree classifier and then check what is the variable importance within the classifier

#Checking feature importance with DTree classifier
# define the model
model = DecisionTreeClassifier()

x = data.drop(columns = ['Chance of Admit', 'Label'])
y = data['Label']

# fit the model
model.fit(x, y)

# get importance
importance = model.feature_importances_

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nsmallest(7).plot(kind='barh')

Feature: 0, Score: 0.12840
Feature: 1, Score: 0.05670
Feature: 2, Score: 0.01696
Feature: 3, Score: 0.05105
Feature: 4, Score: 0.03277
Feature: 5, Score: 0.66132
Feature: 6, Score: 0.05280

<matplotlib.axes._subplots.AxesSubplot at 0x7fc32aae5510>

Modeling¶

Splitting the dataset into train and test and seeing the size

x_train, x_test, y_train, y_test = x[:400], x[400:], y[:400], y[400:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(400, 7)
(400,)
(100, 7)
(100,)

def plot_roc(false_positive_rate, true_positive_rate, roc_auc):
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],linestyle='--')
    plt.axis('tight')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

Model 1: Logistic regression¶

parameters = [
    {
        'penalty' : ['l1', 'l2', 'elasticnet'],
        'C' : [0.1, 0.4, 0.5],
        'random_state' : [0]
    }
]

gscv = GridSearchCV(LogisticRegression(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = LogisticRegression(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())

Best parameters set:
{'C': 0.4, 'penalty': 'l2', 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       193
           1       0.85      0.84      0.85       207

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400

[[163  30]
 [ 33 174]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.79      0.92      0.85        48
           1       0.91      0.77      0.83        52

    accuracy                           0.84       100
   macro avg       0.85      0.84      0.84       100
weighted avg       0.85      0.84      0.84       100

[[44  4]
 [12 40]]

**************************************************
0.8526440879382055
0.09280462777913863

lr = LogisticRegression(C= 0.1, penalty= 'l2', random_state= 0)
lr.fit(x_train,y_train)

y_pred = lr.predict(x_test)
y_proba=lr.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)

Accurancy Score : 0.89
[[50  6]
 [ 5 39]]

Model 2: Decision tree¶

parameters = [
    {
        'criterion' : ['gini', 'entropy'],
        'max_depth' : [3, 4, 5],
        'min_samples_split' : [10, 20, 5],
        'random_state': [0],
        
    }
]

gscv = GridSearchCV(DecisionTreeClassifier(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = DecisionTreeClassifier(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())

Best parameters set:
{'criterion': 'gini', 'max_depth': 4, 'min_samples_split': 20, 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.89      0.89      0.89       196
           1       0.90      0.90      0.90       204

    accuracy                           0.90       400
   macro avg       0.89      0.89      0.89       400
weighted avg       0.90      0.90      0.90       400

[[175  21]
 [ 21 183]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.89      0.93      0.91        54
           1       0.91      0.87      0.89        46

    accuracy                           0.90       100
   macro avg       0.90      0.90      0.90       100
weighted avg       0.90      0.90      0.90       100

[[50  4]
 [ 6 40]]

**************************************************
0.7824569221628045
0.04057699462747589

dt = DecisionTreeClassifier(criterion= 'gini', max_depth= 3, min_samples_split= 10, 
                            random_state= 0)
dt.fit(x_train,y_train)

y_pred = dt.predict(x_test)
y_proba=dt.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)

Accurancy Score : 0.88
[[48  8]
 [ 4 40]]

Model 3: Random forest¶

parameters = [
    {
        'n_estimators': np.arange(10, 40, 5),
        'criterion' : ['gini', 'entropy'],
        'max_depth' : [3, 4, 5],
        'min_samples_split' : [10, 20, 5],
        'random_state': [0],
        
    }
]

gscv = GridSearchCV(RandomForestClassifier(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = RandomForestClassifier(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())

Best parameters set:
{'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 15, 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.91      0.91      0.91       197
           1       0.91      0.92      0.91       203

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400

[[179  18]
 [ 17 186]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.93      0.95      0.94        55
           1       0.93      0.91      0.92        45

    accuracy                           0.93       100
   macro avg       0.93      0.93      0.93       100
weighted avg       0.93      0.93      0.93       100

[[52  3]
 [ 4 41]]

**************************************************
0.8449197860962565
0.05592883956989427

rf = RandomForestClassifier(criterion= 'gini', max_depth= 5, 
                            min_samples_split= 10, n_estimators= 15, 
                            random_state= 0)
rf.fit(x_train,y_train)

y_pred = rf.predict(x_test)
y_proba=rf.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)

Accurancy Score : 0.91
[[50  6]
 [ 3 41]]

Model 4: Gradient boost classifier¶

parameters = [
    {
        'learning_rate': [0.01, 0.02, 0.002],
        'n_estimators' : np.arange(10, 100, 5),
        'max_depth' : [3, 4, 5],
        'min_samples_split' : [10, 20, 5],
        'random_state': [0],
        
    }
]

gscv = GridSearchCV(GradientBoostingClassifier(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = GradientBoostingClassifier(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())

Best parameters set:
{'learning_rate': 0.01, 'max_depth': 4, 'min_samples_split': 20, 'n_estimators': 60, 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.91      0.90      0.91       197
           1       0.91      0.91      0.91       203

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400

[[178  19]
 [ 18 185]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.91      0.94      0.93        54
           1       0.93      0.89      0.91        46

    accuracy                           0.92       100
   macro avg       0.92      0.92      0.92       100
weighted avg       0.92      0.92      0.92       100

[[51  3]
 [ 5 41]]

**************************************************
0.8349673202614379
0.05831019979499152

gbm = GradientBoostingClassifier(learning_rate= 0.02, max_depth= 3, 
                                 min_samples_split= 10, n_estimators= 80, 
                                 random_state= 0)
gbm.fit(x_train,y_train)

y_pred = gbm.predict(x_test)
y_proba = gbm.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)

Accurancy Score : 0.92
[[52  4]
 [ 4 40]]

Conclusion:¶

In this kernel I have learnt and demonstrated how a simple two class binary classification is performed with this dataset.

Please upvote the kernel if you like it, and to motivate me!

Hopefully, this is first of many of my kernels on Kaggle!

References:¶

I refered to lot of other kernels and notebooks as well as lot of stack overflow for the coding doubts, here are the prominent ones that I refered to. Thanks to all the contributers!

#for submission using the random forest
y_proba=rf.predict(x_test)
np.sqrt(mean_squared_error(y_proba, y_test))

0.3

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	1	337	118	4	4.5	4.5	9.65	1	0.92
1	2	324	107	4	4.0	4.5	8.87	1	0.76
2	3	316	104	3	3.0	3.5	8.00	1	0.72
3	4	322	110	3	3.5	2.5	8.67	1	0.80
4	5	314	103	2	2.0	3.0	8.21	0	0.65

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
count	500.000000	500.000000	500.000000	500.000000	500.000000	500.00000	500.000000	500.000000	500.00000
mean	250.500000	316.472000	107.192000	3.114000	3.374000	3.48400	8.576440	0.560000	0.72174
std	144.481833	11.295148	6.081868	1.143512	0.991004	0.92545	0.604813	0.496884	0.14114
min	1.000000	290.000000	92.000000	1.000000	1.000000	1.00000	6.800000	0.000000	0.34000
25%	125.750000	308.000000	103.000000	2.000000	2.500000	3.00000	8.127500	0.000000	0.63000
50%	250.500000	317.000000	107.000000	3.000000	3.500000	3.50000	8.560000	1.000000	0.72000
75%	375.250000	325.000000	112.000000	4.000000	4.000000	4.00000	9.040000	1.000000	0.82000
max	500.000000	340.000000	120.000000	5.000000	5.000000	5.00000	9.920000	1.000000	0.97000

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
Serial No.	1.000000	-0.103839	-0.141696	-0.067641	-0.137352	-0.003694	-0.074289	-0.005332	0.008505
GRE Score	-0.103839	1.000000	0.827200	0.635376	0.613498	0.524679	0.825878	0.563398	0.810351
TOEFL Score	-0.141696	0.827200	1.000000	0.649799	0.644410	0.541563	0.810574	0.467012	0.792228
University Rating	-0.067641	0.635376	0.649799	1.000000	0.728024	0.608651	0.705254	0.427047	0.690132
SOP	-0.137352	0.613498	0.644410	0.728024	1.000000	0.663707	0.712154	0.408116	0.684137
LOR	-0.003694	0.524679	0.541563	0.608651	0.663707	1.000000	0.637469	0.372526	0.645365
CGPA	-0.074289	0.825878	0.810574	0.705254	0.712154	0.637469	1.000000	0.501311	0.882413
Research	-0.005332	0.563398	0.467012	0.427047	0.408116	0.372526	0.501311	1.000000	0.545871
Chance of Admit	0.008505	0.810351	0.792228	0.690132	0.684137	0.645365	0.882413	0.545871	1.000000

	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	337	118	4	4.5	4.5	9.65	1	0.92
1	324	107	4	4.0	4.5	8.87	1	0.76
2	316	104	3	3.0	3.5	8.00	1	0.72
3	322	110	3	3.5	2.5	8.67	1	0.80
4	314	103	2	2.0	3.0	8.21	0	0.65

	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit	Label
269	308	108	4	4.5	5.0	8.34	0	0.77	1
106	329	111	4	4.5	4.5	9.18	1	0.87	1
160	315	103	1	1.5	2.0	7.86	0	0.57	0
257	324	100	3	4.0	5.0	8.64	1	0.78	1
370	310	103	2	2.5	2.5	8.24	0	0.72	0
147	326	114	3	3.0	3.0	9.11	1	0.83	1
446	327	118	4	5.0	5.0	9.67	1	0.93	1
15	314	105	3	3.5	2.5	8.30	0	0.54	0
27	298	98	2	1.5	2.5	7.50	1	0.44	0
466	314	99	4	3.5	4.5	8.73	1	0.71	0