After working on the college admissions challenge as a prediction model to predict the propensity of the student getting admitted into a particular college, I wanted to take a different approach and see, which rank college the student has the best possiblity of getting admitted into.
I have already performed EDA in the first notebook that I published on this dataset. Please take a look at this notebook and upvote if you like it!
In this notebook, I look at this data from a recommendation perspective to see which rank college the student has the best shot of getting admitted to given his other features.
The sections in this notebook are:
For working this from a recommendation perspective, we train 5 regression model, each for one of the university rank from 1 to 5. The first regression model will compute a score, which is the probability of the student getting admitted into the university with university rank 1, the second regression model will compute a score of the student getting admitted into the university with rank 2 and so on.
Finally all these scores will be compared against each other and in unison will recommend the university rank which the student has the best possiblity of getting admitted to.
For this purpouse we use the other features given below:
And we will use the variable Chance of Admit as the Y variable.
We will use the variable University Rating to take out data corresponding to each of the universities and then train 5 independent model corresponding all the 5 university ranking
For the purpouse of prediction, for one single student record, we will make the prediction out of all the 5 models and will compare the scores with each other to make the best recommended university ranking for the student.
For importing the packages, I just like to put them in alphabetical order of the package, so that it is easy to manage and review if needed
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
Since there were two files, I did not want to leave out any records, so I imported both the datasets, concatenated into a single dataset and then dropped duplicates.
#Reading the datasets
data_v1 = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv")
data_v0 = pd.read_csv("/kaggle/input/graduate-admissions/Admission_Predict.csv")
data = pd.concat([data_v1, data_v0])
print(data.shape)
data.head()
data = data.drop_duplicates()
data.shape
Turns out there are no extra records between both the datasets, whatever records were there in the V1.1 was also present in the first dataset.
#Removing the serial number column as it adds no correlation to any columns
data = data.drop(columns = ["Serial No."])
#The column "Chance of Admit" has a trailing space which is removed
data = data.rename(columns={"Chance of Admit ": "Chance of Admit"})
data.head()
def get_training_data(df):
"""
This function splits the data into X and y variables and returns them
"""
X = df.drop(columns = ["University Rating", "Chance of Admit"])
y = df["Chance of Admit"]
return X, y
def train_model(university_rating):
"""
1. Takes the subset only for one university rating
2. Invokes the get_training_data function,
3. Fits a linear regression model
4. Cross validates it
5. Returns the model object and the metrics for cross validation
"""
#Filtering for one university fromt the data dataframe
df = data[data["University Rating"] == university_rating]
print(df.shape)
#Splitting into X and y for regression
X, y = get_training_data(df)
regressor = LinearRegression()
regressor.fit(X, y)
metric = cross_val_score(regressor, X, y, cv = 5)
return regressor, metric
Now let us train the 5 models and save the models to a object
university_ratings = data["University Rating"].unique()
university_recommendations = {}
for u in university_ratings:
regressor, metric = train_model(u)
university_recommendations["University ranking " + str(u)] = {'model': regressor, 'metric': metric}
university_recommendations
The cross validation summary does not look good!
We will prepare a test data with random 20 observations to see how good the model performs!
test = data.sample(20)
test = test.drop(columns = ["Chance of Admit", "University Rating"])
test.head()
predictions = {}
for uni in university_recommendations.keys():
model = university_recommendations[uni]["model"]
predictions[uni] = model.predict(test)
pred = pd.DataFrame(predictions)
pred.head(10)
As you can see, the above dataframe represents each student's propensity to get admitted into the university of that rank!
The limitation of this technique on this dataset is not having enough data. There were only 500 records and it was not evenly distributed among all the universities!
But generally this technique can seen to be working with a large amount of data. Because of the limitation of the data, I did not split the notebook into test and train sets. But a validation score such as MAP or MAR (Mean absolute precision or Mean absolute recall) could have been incorporated if there were train and test sets.
Please let me know about your feedback on this notebook in the comments section below! And upvote this notebook if you found it interesting!