Predict Graduation Rates (2019 U.S. College Data)
Date: 11 Jun 2020 Tag(s): Jupyter Notebook, R, ggplot, College/University, Models Categories: Machine Learning, ML, Regression Download Project: 1.5 MB - Zip ArchiveOverview #
This analysis creates and compares several different models created to try and predict a colleges’ graduation rate given a variety of variables that could affect students, like: Economic, Academic, Enrollment, Faculty, etc… A comparison of linear regression models was done using significant variables from the dataset to narrow down the relevant variables. From there other types of models were created for further comparison: random forest and k-nearest neighbors (KNN).
The notebook explores the various factors that might affect a universities’ graduation rate and if a model can be created that could help university administrators predict what the graduation rate might be.
NOTE:
- This notebook is built using R instead of Python, see more info on how to use the notebook in the project README
- A deeper data analysis was done on this same dataset, see: Public and Private Graduation Rates Analysis (2019 U.S. College Data)
View Jupyter Notebook #
The notebook goes step-by-step through the project, please follow the directions and cell order if you would like to replicate the results.
- Click the "View Notebook" button to open the rendered notebook in a new tab
- Click the "GitHub" button to view the project in the GitHub portfolio repo
Methodology #
A series of linear regression models were created with significantly correlated variables to Grad.Rate
and compared
together to see which variables made the best he best model. From there, several other models were created (random forest
and k-nearest neighbors) and compared to find the best one possible for predicting the graduation rate.
Correlation Matrix #
A correlation matrix was created to find which variables were highly correlated with Grad.Rate
. From the matrix below,
we can see that the variable with the highest correlation with Grad.Rate
is Outstate at 0.54
and perc.alumni
coming
in close with 0.48
. Unfortunately, neither of these variables have a particularly good correlation with Grad.Rate
and
the rest of the variables fair far worse.
Linear Models Comparisons #
Four linear regression models were fitted with various variables from the dataset to find which combination created the best model, below are their formulas:
Model 1 #
- Formula (all variables):
lm(formula = Grad.Rate ~ ., data = clean_data)
Model 2 #
- Formula (significant variables from the first model):
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + Top25perc + Apps + Private + Personal + PhD + P.Undergrad)
Model 3 #
- Formula (significant variables from the second model):
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Room.Board + Top25perc + Apps + Private + Personal + P.Undergrad)
Model 4 #
- Formula (significant variables from the third model):
lm(formula = Grad.Rate ~ perc.alumni + Expend + Outstate + Top25perc + Apps + Private)
Below is a graph of the models R^2 and RMSE scores:
Linear Analysis Plots #
Analysis graphs of the models were plotted to better understand their performance with the data. The main one we want to
look at is the Normal Q-Q
plot in the top right corner. We want the grouping of points to be as close to the dotted
line as possible.
NOTE: The other two analysis plots can be viewed in the linked notebook
Below are the plots for the 1st model
and the 4th model
:
Model 1: All variables
Model 4: perc.alumni + Expend + Outstate + Top25perc + Apps + Private
From model 1 and model 4, We can see the point grouping on the Normal Q-Q plots get closer to the line, with the last model (model 4) having the points closest to the line. All of these graphs still have skew on right-hand tail but the fourth model is the best fit of them all.
Other Models: Random Forest and K-Nearest Neighbors #
For comparison, models were created using Random Forest and K-Nearest Neighbors (KNN) using the variable structure in models 1 and 4.
Overall, the model with the lowest RMSE (Root Mean Square Error) and the highest R^2 (also known as the Coefficient of Determination) is the 1st random forest model (clean_rf1). The random forest models faired about the same (just slightly better) as the linear models and curiously, the knn models fared far worse than both the linear and randomForest models. However, the R^2 is still very low and the RMSE means that model is only able to predict the Grad Rate of a University from within about the RMSE margin. In terms of a graduation rate, that is quite a large margin and as a result, this should probably not be used to predict a universities’ graduation rate.
Conclusion #
In conclusion, based on all of the analysis, I am confident when I say that this dataset does not have either enough data or the right variables measured to predict the graduation rate of these universities to a confident degree. The models RMSE’s and R^2’s are not good enough to be able to use the models for graduation rate prediction of the universities. I would recommend that more data be collected and other variables of measure be collected as well to get a better spread of data to be to get more accurate models.
I think perhaps more years of data collection is need and/or other variables to be measured for a better reflection on what might influence a students ability to graduate.