Introduction:¶

After working on the college admissions challenge as a prediction model to predict the propensity of the student getting admitted into a particular college, I wanted to take a different approach and see, which rank college the student has the best possiblity of getting admitted into.

I have already performed EDA in the first notebook that I published on this dataset. Please take a look at this notebook and upvote if you like it!

In this notebook, I look at this data from a recommendation perspective to see which rank college the student has the best shot of getting admitted to given his other features.

The sections in this notebook are:

Introduction
The approach
Importing and exploring the datasets
Model building
Getting recommendation from the model
Conclusion

The approach¶

For working this from a recommendation perspective, we train 5 regression model, each for one of the university rank from 1 to 5. The first regression model will compute a score, which is the probability of the student getting admitted into the university with university rank 1, the second regression model will compute a score of the student getting admitted into the university with rank 2 and so on.

Finally all these scores will be compared against each other and in unison will recommend the university rank which the student has the best possiblity of getting admitted to.

For this purpouse we use the other features given below:

GRE Score
TOEFL Score
SOP
LOR
CGPA
Research

And we will use the variable Chance of Admit as the Y variable.

We will use the variable University Rating to take out data corresponding to each of the universities and then train 5 independent model corresponding all the 5 university ranking

For the purpouse of prediction, for one single student record, we will make the prediction out of all the 5 models and will compare the scores with each other to make the best recommended university ranking for the student.

Importing and exploring the datasets¶

For importing the packages, I just like to put them in alphabetical order of the package, so that it is easy to manage and review if needed

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

import scipy
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

Since there were two files, I did not want to leave out any records, so I imported both the datasets, concatenated into a single dataset and then dropped duplicates.

#Reading the datasets
data_v1 = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv")
data_v0 = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict.csv")
data = pd.concat([data_v1, data_v0])

print(data.shape)

data.head()

(900, 9)

data = data.drop_duplicates()
data.shape

(500, 9)

Turns out there are no extra records between both the datasets, whatever records were there in the V1.1 was also present in the first dataset.

#Removing the serial number column as it adds no correlation to any columns
data = data.drop(columns = ["Serial No."])

#The column "Chance of Admit" has a trailing space which is removed
data = data.rename(columns={"Chance of Admit ": "Chance of Admit"})

data.head()

Model building¶

def get_training_data(df):
    """
    This function splits the data into X and y variables and returns them
    """
    X = df.drop(columns = ["University Rating", "Chance of Admit"])
    y = df["Chance of Admit"]
    
    return X, y

def train_model(university_rating):
    """
    1. Takes the subset only for one university rating
    2. Invokes the get_training_data function,
    3. Fits a linear regression model
    4. Cross validates it
    5. Returns the model object and the metrics for cross validation
    """
    #Filtering for one university fromt the data dataframe
    df = data[data["University Rating"] == university_rating]
    print(df.shape)
    
    #Splitting into X and y for regression
    X, y = get_training_data(df)
    
    regressor = LinearRegression()
    regressor.fit(X, y)
    
    metric = cross_val_score(regressor, X, y, cv = 5)
    
    return regressor, metric

Now let us train the 5 models and save the models to a object

university_ratings = data["University Rating"].unique()

university_recommendations = {}

for u in university_ratings:
    regressor, metric = train_model(u)
    university_recommendations["University ranking " + str(u)] = {'model': regressor, 'metric': metric}

(105, 8)
(162, 8)
(126, 8)
(73, 8)
(34, 8)

university_recommendations

{'University ranking 4': {'model': LinearRegression(),
  'metric': array([0.70503399, 0.88116779, 0.43667585, 0.94462783, 0.90363293])},
 'University ranking 3': {'model': LinearRegression(),
  'metric': array([0.04196118, 0.22399362, 0.14130896, 0.58430267, 0.78058286])},
 'University ranking 2': {'model': LinearRegression(),
  'metric': array([-0.64325416,  0.32028836,  0.48696248,  0.34106336,  0.65715079])},
 'University ranking 5': {'model': LinearRegression(),
  'metric': array([0.89144242, 0.85894631, 0.87964782, 0.78529031, 0.85653355])},
 'University ranking 1': {'model': LinearRegression(),
  'metric': array([ 0.91791431, -0.03123308,  0.5077833 ,  0.58622725,  0.77048993])}}

The cross validation summary does not look good!

Getting recommendation from the model¶

We will prepare a test data with random 20 observations to see how good the model performs!

test = data.sample(20)
test = test.drop(columns = ["Chance of Admit", "University Rating"])
test.head()

predictions = {}

for uni in university_recommendations.keys():
    model = university_recommendations[uni]["model"]
    
    predictions[uni] = model.predict(test)
    
pred = pd.DataFrame(predictions)
pred.head(10)

As you can see, the above dataframe represents each student's propensity to get admitted into the university of that rank!

Conclusion:¶

The limitation of this technique on this dataset is not having enough data. There were only 500 records and it was not evenly distributed among all the universities!

But generally this technique can seen to be working with a large amount of data. Because of the limitation of the data, I did not split the notebook into test and train sets. But a validation score such as MAP or MAR (Mean absolute precision or Mean absolute recall) could have been incorporated if there were train and test sets.

Please let me know about your feedback on this notebook in the comments section below! And upvote this notebook if you found it interesting!

	Serial No.	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	1	337	118	4	4.5	4.5	9.65	1	0.92
1	2	324	107	4	4.0	4.5	8.87	1	0.76
2	3	316	104	3	3.0	3.5	8.00	1	0.72
3	4	322	110	3	3.5	2.5	8.67	1	0.80
4	5	314	103	2	2.0	3.0	8.21	0	0.65

	GRE Score	TOEFL Score	University Rating	SOP	LOR	CGPA	Research	Chance of Admit
0	337	118	4	4.5	4.5	9.65	1	0.92
1	324	107	4	4.0	4.5	8.87	1	0.76
2	316	104	3	3.0	3.5	8.00	1	0.72
3	322	110	3	3.5	2.5	8.67	1	0.80
4	314	103	2	2.0	3.0	8.21	0	0.65

	GRE Score	TOEFL Score	SOP	LOR	CGPA	Research
499	327	113	4.5	4.5	9.04	0
78	296	95	3.0	2.0	7.54	1
254	321	114	4.0	5.0	9.12	0
472	327	116	4.0	4.5	9.48	1
266	312	105	2.0	2.5	8.45	0

	University ranking 4	University ranking 3	University ranking 2	University ranking 5	University ranking 1
0	0.818540	0.797745	0.805333	0.805250	0.844485
1	0.505933	0.524233	0.475614	0.620970	0.473850
2	0.813073	0.808833	0.827293	0.795664	0.876459
3	0.897444	0.886831	0.895612	0.906306	0.902600
4	0.605242	0.665792	0.672100	0.638099	0.663432
5	0.676663	0.657201	0.623715	0.746365	0.646300
6	0.700796	0.709100	0.707895	0.716795	0.712413
7	0.808943	0.770826	0.760663	0.848153	0.753670
8	0.651540	0.648149	0.624311	0.722527	0.659528
9	0.843120	0.837785	0.826909	0.865957	0.853813