Introduction:

This is my first Kernel on Kaggle, I wanted to build foundation with a simple classification model.

I choose this dataset because it is clean and simple, with less number of variables and observations, an ideal dataset for me to work on.

I have structured the notebook into the following tasks:

  1. Importing and exploring the dataset
  2. EDA on the dataset
  3. Defining classification labels
  4. Modelling
  5. Conclusion
  6. References

Importing and exploring the datasets

For importing the packages, I just like to put them in alphabetical order of the package, so that it is easy to manage and review if needed

In [1]:
#Importing the necessary packages

import collections

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

import scipy
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, plot_roc_curve, accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')
In [2]:
#Reading the datasets
data = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv")
data.head()
Out[2]:
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
2 3 316 104 3 3.0 3.5 8.00 1 0.72
3 4 322 110 3 3.5 2.5 8.67 1 0.80
4 5 314 103 2 2.0 3.0 8.21 0 0.65
In [3]:
data.describe()
Out[3]:
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.00000 500.000000 500.000000 500.00000
mean 250.500000 316.472000 107.192000 3.114000 3.374000 3.48400 8.576440 0.560000 0.72174
std 144.481833 11.295148 6.081868 1.143512 0.991004 0.92545 0.604813 0.496884 0.14114
min 1.000000 290.000000 92.000000 1.000000 1.000000 1.00000 6.800000 0.000000 0.34000
25% 125.750000 308.000000 103.000000 2.000000 2.500000 3.00000 8.127500 0.000000 0.63000
50% 250.500000 317.000000 107.000000 3.000000 3.500000 3.50000 8.560000 1.000000 0.72000
75% 375.250000 325.000000 112.000000 4.000000 4.000000 4.00000 9.040000 1.000000 0.82000
max 500.000000 340.000000 120.000000 5.000000 5.000000 5.00000 9.920000 1.000000 0.97000
In [4]:
data.isnull().sum()
Out[4]:
Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64

Good that there are no missing values in the dataset. It makes our data pre-processing very much easier.

Now let's check the correlation between the variables.

In [5]:
corr = data.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[5]:
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
Serial No. 1.000000 -0.103839 -0.141696 -0.067641 -0.137352 -0.003694 -0.074289 -0.005332 0.008505
GRE Score -0.103839 1.000000 0.827200 0.635376 0.613498 0.524679 0.825878 0.563398 0.810351
TOEFL Score -0.141696 0.827200 1.000000 0.649799 0.644410 0.541563 0.810574 0.467012 0.792228
University Rating -0.067641 0.635376 0.649799 1.000000 0.728024 0.608651 0.705254 0.427047 0.690132
SOP -0.137352 0.613498 0.644410 0.728024 1.000000 0.663707 0.712154 0.408116 0.684137
LOR -0.003694 0.524679 0.541563 0.608651 0.663707 1.000000 0.637469 0.372526 0.645365
CGPA -0.074289 0.825878 0.810574 0.705254 0.712154 0.637469 1.000000 0.501311 0.882413
Research -0.005332 0.563398 0.467012 0.427047 0.408116 0.372526 0.501311 1.000000 0.545871
Chance of Admit 0.008505 0.810351 0.792228 0.690132 0.684137 0.645365 0.882413 0.545871 1.000000
In [6]:
#Removing the serial number column as it adds no correlation to any columns
data = data.drop(columns = ["Serial No."])

#The column "Chance of Admit" has a trailing space which is removed
data = data.rename(columns={"Chance of Admit ": "Chance of Admit"})

data.head()
Out[6]:
GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 337 118 4 4.5 4.5 9.65 1 0.92
1 324 107 4 4.0 4.5 8.87 1 0.76
2 316 104 3 3.0 3.5 8.00 1 0.72
3 322 110 3 3.5 2.5 8.67 1 0.80
4 314 103 2 2.0 3.0 8.21 0 0.65

Exploratory data analysis

The main EDA that I performed on this dataset is to see how the variables are distributed, to check if the variables are distributed normally. For that the pair plot is used to check the histogram of the variables as well as for the scatter plot to see how the variables are corelated to each other.

In [7]:
plt.hist(data["Chance of Admit"])
plt.xlabel("Chance of Admit")
plt.ylabel("Count")
plt.show()
In [8]:
sns.pairplot(data)
Out[8]:
<seaborn.axisgrid.PairGrid at 0x7fc32f15f810>
In [9]:
sns.kdeplot(data["Chance of Admit"], data["GRE Score"], cmap="Blues", shade=True, shade_lowest=False)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc32cee1590>
In [10]:
sns.kdeplot(data["Chance of Admit"], data["University Rating"], cmap="Blues", shade=True, shade_lowest=False)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc32ac8f9d0>
In [11]:
sns.kdeplot(data["GRE Score"], data["University Rating"], cmap="Blues", shade=True, shade_lowest=False)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc32ac14610>
In [12]:
sns.scatterplot(data["GRE Score"], data["University Rating"])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc32ab96450>

Defining the class labels for classification

For ease of working with the classifier, it will be nice to have a 50/50 split on the data.

For class balance, let us assume that the bottom 50% of the observations fall in class 0 (no or less chance of admit), and the top 50% of the observations fall in class 1.

Binning the Chance of Admit variable and seeing where the 50% lies

In [13]:
collections.Counter([i-i%0.1+0.1 for i in data["Chance of Admit"]])
Out[13]:
Counter({1.0: 61,
         0.8: 132,
         0.9: 94,
         0.7000000000000001: 116,
         0.5: 31,
         0.6: 58,
         0.4: 8})
In [14]:
data['Label'] = np.where(data["Chance of Admit"] <= 0.72, 0, 1)
print(data['Label'].value_counts())
data.sample(10)
0    252
1    248
Name: Label, dtype: int64
Out[14]:
GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit Label
269 308 108 4 4.5 5.0 8.34 0 0.77 1
106 329 111 4 4.5 4.5 9.18 1 0.87 1
160 315 103 1 1.5 2.0 7.86 0 0.57 0
257 324 100 3 4.0 5.0 8.64 1 0.78 1
370 310 103 2 2.5 2.5 8.24 0 0.72 0
147 326 114 3 3.0 3.0 9.11 1 0.83 1
446 327 118 4 5.0 5.0 9.67 1 0.93 1
15 314 105 3 3.5 2.5 8.30 0 0.54 0
27 298 98 2 1.5 2.5 7.50 1 0.44 0
466 314 99 4 3.5 4.5 8.73 1 0.71 0

We now have 252 observations in class 0 and 248 observations in class 1, which is good enough balance that we are expecting

Checking variable importance

Let us now check what variables are important for out labels. For checking variable importance, we will use a basic decision tree classifier and then check what is the variable importance within the classifier

In [15]:
#Checking feature importance with DTree classifier
# define the model
model = DecisionTreeClassifier()

x = data.drop(columns = ['Chance of Admit', 'Label'])
y = data['Label']

# fit the model
model.fit(x, y)

# get importance
importance = model.feature_importances_

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

feat_importances = pd.Series(model.feature_importances_, index=x.columns)
feat_importances.nsmallest(7).plot(kind='barh')
Feature: 0, Score: 0.12840
Feature: 1, Score: 0.05670
Feature: 2, Score: 0.01696
Feature: 3, Score: 0.05105
Feature: 4, Score: 0.03277
Feature: 5, Score: 0.66132
Feature: 6, Score: 0.05280
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc32aae5510>

Modeling

Splitting the dataset into train and test and seeing the size

In [16]:
x_train, x_test, y_train, y_test = x[:400], x[400:], y[:400], y[400:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(400, 7)
(400,)
(100, 7)
(100,)
In [17]:
def plot_roc(false_positive_rate, true_positive_rate, roc_auc):
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],linestyle='--')
    plt.axis('tight')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

Model 1: Logistic regression

In [18]:
parameters = [
    {
        'penalty' : ['l1', 'l2', 'elasticnet'],
        'C' : [0.1, 0.4, 0.5],
        'random_state' : [0]
    }
]

gscv = GridSearchCV(LogisticRegression(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = LogisticRegression(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())
Best parameters set:
{'C': 0.4, 'penalty': 'l2', 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       193
           1       0.85      0.84      0.85       207

    accuracy                           0.84       400
   macro avg       0.84      0.84      0.84       400
weighted avg       0.84      0.84      0.84       400

[[163  30]
 [ 33 174]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.79      0.92      0.85        48
           1       0.91      0.77      0.83        52

    accuracy                           0.84       100
   macro avg       0.85      0.84      0.84       100
weighted avg       0.85      0.84      0.84       100

[[44  4]
 [12 40]]

**************************************************
0.8526440879382055
0.09280462777913863
In [19]:
lr = LogisticRegression(C= 0.1, penalty= 'l2', random_state= 0)
lr.fit(x_train,y_train)

y_pred = lr.predict(x_test)
y_proba=lr.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)
Accurancy Score : 0.89
[[50  6]
 [ 5 39]]

Model 2: Decision tree

In [20]:
parameters = [
    {
        'criterion' : ['gini', 'entropy'],
        'max_depth' : [3, 4, 5],
        'min_samples_split' : [10, 20, 5],
        'random_state': [0],
        
    }
]

gscv = GridSearchCV(DecisionTreeClassifier(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = DecisionTreeClassifier(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())
Best parameters set:
{'criterion': 'gini', 'max_depth': 4, 'min_samples_split': 20, 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.89      0.89      0.89       196
           1       0.90      0.90      0.90       204

    accuracy                           0.90       400
   macro avg       0.89      0.89      0.89       400
weighted avg       0.90      0.90      0.90       400

[[175  21]
 [ 21 183]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.89      0.93      0.91        54
           1       0.91      0.87      0.89        46

    accuracy                           0.90       100
   macro avg       0.90      0.90      0.90       100
weighted avg       0.90      0.90      0.90       100

[[50  4]
 [ 6 40]]

**************************************************
0.7824569221628045
0.04057699462747589
In [21]:
dt = DecisionTreeClassifier(criterion= 'gini', max_depth= 3, min_samples_split= 10, 
                            random_state= 0)
dt.fit(x_train,y_train)

y_pred = dt.predict(x_test)
y_proba=dt.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)
Accurancy Score : 0.88
[[48  8]
 [ 4 40]]

Model 3: Random forest

In [22]:
parameters = [
    {
        'n_estimators': np.arange(10, 40, 5),
        'criterion' : ['gini', 'entropy'],
        'max_depth' : [3, 4, 5],
        'min_samples_split' : [10, 20, 5],
        'random_state': [0],
        
    }
]

gscv = GridSearchCV(RandomForestClassifier(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = RandomForestClassifier(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())
Best parameters set:
{'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 10, 'n_estimators': 15, 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.91      0.91      0.91       197
           1       0.91      0.92      0.91       203

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400

[[179  18]
 [ 17 186]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.93      0.95      0.94        55
           1       0.93      0.91      0.92        45

    accuracy                           0.93       100
   macro avg       0.93      0.93      0.93       100
weighted avg       0.93      0.93      0.93       100

[[52  3]
 [ 4 41]]

**************************************************
0.8449197860962565
0.05592883956989427
In [23]:
rf = RandomForestClassifier(criterion= 'gini', max_depth= 5, 
                            min_samples_split= 10, n_estimators= 15, 
                            random_state= 0)
rf.fit(x_train,y_train)

y_pred = rf.predict(x_test)
y_proba=rf.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)
Accurancy Score : 0.91
[[50  6]
 [ 3 41]]

Model 4: Gradient boost classifier

In [24]:
parameters = [
    {
        'learning_rate': [0.01, 0.02, 0.002],
        'n_estimators' : np.arange(10, 100, 5),
        'max_depth' : [3, 4, 5],
        'min_samples_split' : [10, 20, 5],
        'random_state': [0],
        
    }
]

gscv = GridSearchCV(GradientBoostingClassifier(),parameters,scoring='accuracy')
gscv.fit(x_train, y_train)

print('Best parameters set:')
print(gscv.best_params_)
print()

print("*"*50)
print("Train classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_train), y_train))
print(confusion_matrix(gscv.predict(x_train), y_train))

print()
print("*"*50)
print("Test classification report: ")
print("*"*50)
print(classification_report(gscv.predict(x_test), y_test))
print(confusion_matrix(gscv.predict(x_test), y_test))

#Crossvalidation:
cvs = cross_val_score(estimator = GradientBoostingClassifier(), 
                      X = x_train, y = y_train, cv = 12)

print()
print("*"*50)
print(cvs.mean())
print(cvs.std())
Best parameters set:
{'learning_rate': 0.01, 'max_depth': 4, 'min_samples_split': 20, 'n_estimators': 60, 'random_state': 0}

**************************************************
Train classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.91      0.90      0.91       197
           1       0.91      0.91      0.91       203

    accuracy                           0.91       400
   macro avg       0.91      0.91      0.91       400
weighted avg       0.91      0.91      0.91       400

[[178  19]
 [ 18 185]]

**************************************************
Test classification report: 
**************************************************
              precision    recall  f1-score   support

           0       0.91      0.94      0.93        54
           1       0.93      0.89      0.91        46

    accuracy                           0.92       100
   macro avg       0.92      0.92      0.92       100
weighted avg       0.92      0.92      0.92       100

[[51  3]
 [ 5 41]]

**************************************************
0.8349673202614379
0.05831019979499152
In [25]:
gbm = GradientBoostingClassifier(learning_rate= 0.02, max_depth= 3, 
                                 min_samples_split= 10, n_estimators= 80, 
                                 random_state= 0)
gbm.fit(x_train,y_train)

y_pred = gbm.predict(x_test)
y_proba = gbm.predict_proba(x_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)

plot_roc(false_positive_rate, true_positive_rate, roc_auc)

print('Accurancy Score :',accuracy_score(y_test, y_pred))

cm=confusion_matrix(y_test,y_pred)
print(cm)
Accurancy Score : 0.92
[[52  4]
 [ 4 40]]

Conclusion:

In this kernel I have learnt and demonstrated how a simple two class binary classification is performed with this dataset.

Please upvote the kernel if you like it, and to motivate me!

Hopefully, this is first of many of my kernels on Kaggle!

References:

I refered to lot of other kernels and notebooks as well as lot of stack overflow for the coding doubts, here are the prominent ones that I refered to. Thanks to all the contributers!

  1. https://stackoverflow.com/questions/15697350/binning-frequency-distribution-in-python
  2. https://machinelearningmastery.com/calculate-feature-importance-with-python/
  3. https://www.kaggle.com/kralmachine/analyzing-the-graduate-admission-eda-ml
In [26]:
#for submission using the random forest
y_proba=rf.predict(x_test)
np.sqrt(mean_squared_error(y_proba, y_test))
Out[26]:
0.3