Introduction:

After working on the college admissions challenge as a prediction model to predict the propensity of the student getting admitted into a particular college, I wanted to take a different approach and see, which rank college the student has the best possiblity of getting admitted into.

I have already performed EDA in the first notebook that I published on this dataset. Please take a look at this notebook and upvote if you like it!

In this notebook, I look at this data from a recommendation perspective to see which rank college the student has the best shot of getting admitted to given his other features.

The sections in this notebook are:

  1. Introduction
  2. The approach
  3. Importing and exploring the datasets
  4. Model building
  5. Getting recommendation from the model
  6. Conclusion

The approach

For working this from a recommendation perspective, we train 5 regression model, each for one of the university rank from 1 to 5. The first regression model will compute a score, which is the probability of the student getting admitted into the university with university rank 1, the second regression model will compute a score of the student getting admitted into the university with rank 2 and so on.

Finally all these scores will be compared against each other and in unison will recommend the university rank which the student has the best possiblity of getting admitted to.

For this purpouse we use the other features given below:

  1. GRE Score
  2. TOEFL Score
  3. SOP
  4. LOR
  5. CGPA
  6. Research

And we will use the variable Chance of Admit as the Y variable.

We will use the variable University Rating to take out data corresponding to each of the universities and then train 5 independent model corresponding all the 5 university ranking

For the purpouse of prediction, for one single student record, we will make the prediction out of all the 5 models and will compare the scores with each other to make the best recommended university ranking for the student.

Importing and exploring the datasets

For importing the packages, I just like to put them in alphabetical order of the package, so that it is easy to manage and review if needed

In [1]:
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

import scipy
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

Since there were two files, I did not want to leave out any records, so I imported both the datasets, concatenated into a single dataset and then dropped duplicates.

In [2]:
#Reading the datasets
data_v1 = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv")
data_v0 = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict.csv")
data = pd.concat([data_v1, data_v0])

print(data.shape)

data.head()
(900, 9)
Out[2]:
Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 1 337 118 4 4.5 4.5 9.65 1 0.92
1 2 324 107 4 4.0 4.5 8.87 1 0.76
2 3 316 104 3 3.0 3.5 8.00 1 0.72
3 4 322 110 3 3.5 2.5 8.67 1 0.80
4 5 314 103 2 2.0 3.0 8.21 0 0.65
In [3]:
data = data.drop_duplicates()
data.shape
Out[3]:
(500, 9)

Turns out there are no extra records between both the datasets, whatever records were there in the V1.1 was also present in the first dataset.

In [4]:
#Removing the serial number column as it adds no correlation to any columns
data = data.drop(columns = ["Serial No."])

#The column "Chance of Admit" has a trailing space which is removed
data = data.rename(columns={"Chance of Admit ": "Chance of Admit"})

data.head()
Out[4]:
GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
0 337 118 4 4.5 4.5 9.65 1 0.92
1 324 107 4 4.0 4.5 8.87 1 0.76
2 316 104 3 3.0 3.5 8.00 1 0.72
3 322 110 3 3.5 2.5 8.67 1 0.80
4 314 103 2 2.0 3.0 8.21 0 0.65

Model building

In [5]:
def get_training_data(df):
    """
    This function splits the data into X and y variables and returns them
    """
    X = df.drop(columns = ["University Rating", "Chance of Admit"])
    y = df["Chance of Admit"]
    
    return X, y
In [6]:
def train_model(university_rating):
    """
    1. Takes the subset only for one university rating
    2. Invokes the get_training_data function,
    3. Fits a linear regression model
    4. Cross validates it
    5. Returns the model object and the metrics for cross validation
    """
    #Filtering for one university fromt the data dataframe
    df = data[data["University Rating"] == university_rating]
    print(df.shape)
    
    #Splitting into X and y for regression
    X, y = get_training_data(df)
    
    regressor = LinearRegression()
    regressor.fit(X, y)
    
    metric = cross_val_score(regressor, X, y, cv = 5)
    
    return regressor, metric

Now let us train the 5 models and save the models to a object

In [7]:
university_ratings = data["University Rating"].unique()

university_recommendations = {}

for u in university_ratings:
    regressor, metric = train_model(u)
    university_recommendations["University ranking " + str(u)] = {'model': regressor, 'metric': metric}
(105, 8)
(162, 8)
(126, 8)
(73, 8)
(34, 8)
In [8]:
university_recommendations
Out[8]:
{'University ranking 4': {'model': LinearRegression(),
  'metric': array([0.70503399, 0.88116779, 0.43667585, 0.94462783, 0.90363293])},
 'University ranking 3': {'model': LinearRegression(),
  'metric': array([0.04196118, 0.22399362, 0.14130896, 0.58430267, 0.78058286])},
 'University ranking 2': {'model': LinearRegression(),
  'metric': array([-0.64325416,  0.32028836,  0.48696248,  0.34106336,  0.65715079])},
 'University ranking 5': {'model': LinearRegression(),
  'metric': array([0.89144242, 0.85894631, 0.87964782, 0.78529031, 0.85653355])},
 'University ranking 1': {'model': LinearRegression(),
  'metric': array([ 0.91791431, -0.03123308,  0.5077833 ,  0.58622725,  0.77048993])}}

The cross validation summary does not look good!

Getting recommendation from the model

We will prepare a test data with random 20 observations to see how good the model performs!

In [9]:
test = data.sample(20)
test = test.drop(columns = ["Chance of Admit", "University Rating"])
test.head()
Out[9]:
GRE Score TOEFL Score SOP LOR CGPA Research
499 327 113 4.5 4.5 9.04 0
78 296 95 3.0 2.0 7.54 1
254 321 114 4.0 5.0 9.12 0
472 327 116 4.0 4.5 9.48 1
266 312 105 2.0 2.5 8.45 0
In [10]:
predictions = {}

for uni in university_recommendations.keys():
    model = university_recommendations[uni]["model"]
    
    predictions[uni] = model.predict(test)
    
pred = pd.DataFrame(predictions)
pred.head(10)
Out[10]:
University ranking 4 University ranking 3 University ranking 2 University ranking 5 University ranking 1
0 0.818540 0.797745 0.805333 0.805250 0.844485
1 0.505933 0.524233 0.475614 0.620970 0.473850
2 0.813073 0.808833 0.827293 0.795664 0.876459
3 0.897444 0.886831 0.895612 0.906306 0.902600
4 0.605242 0.665792 0.672100 0.638099 0.663432
5 0.676663 0.657201 0.623715 0.746365 0.646300
6 0.700796 0.709100 0.707895 0.716795 0.712413
7 0.808943 0.770826 0.760663 0.848153 0.753670
8 0.651540 0.648149 0.624311 0.722527 0.659528
9 0.843120 0.837785 0.826909 0.865957 0.853813

As you can see, the above dataframe represents each student's propensity to get admitted into the university of that rank!

Conclusion:

The limitation of this technique on this dataset is not having enough data. There were only 500 records and it was not evenly distributed among all the universities!

But generally this technique can seen to be working with a large amount of data. Because of the limitation of the data, I did not split the notebook into test and train sets. But a validation score such as MAP or MAR (Mean absolute precision or Mean absolute recall) could have been incorporated if there were train and test sets.

Please let me know about your feedback on this notebook in the comments section below! And upvote this notebook if you found it interesting!