A Systematic Comparison of Predictive Analytics Methods in Higher Education

Colleges and universities spend hundreds of millions of dollars on predictive analytics software. However, many of these products are operated by private companies that provide little if any transparency about the underlying modeling that drives the risk ratings they generate. The goal of this project is to compare the performance of predictive models when we systematically vary the methods we use to generate risk ratings and the data assumptions that underlie our modeling decisions.

The Problem

Many predictive analytics software products are operated by private companies that provide little (if any) transparency about the underlying modeling that drives the risk ratings they generate; creating multiple risks for students and institutions:

Models may vary substantially in the accuracy with which they identify the probability a student will complete (or drop out from) college; leading to inefficient and ineffective investment of institutional resources.
Individuals colleges and universities spend hundreds of thousands of dollars annually on private predictive analytics software; however, it is possible that simpler and much less expensive modeling approaches work as well as more complex models.
Models can be designed and operated in a way that incorporates bias against disadvantaged or underrepresented groups, and may lead to privileged students receiving a disproportionate of institutional resources.

There is not enough transparency around predictive analytics products used by colleges and universities.

Deep Dive into the Data

Our sample consists of students who enrolled at a VCCS college as a degree-seeking student for at least one term, with an initial enrollment term between the Summer 2007 (when full data availability begins) and Summer 2012terms (the last cohort for whom we can observe six years of degree completion). We define credential-seeking status as being enrolled in a college-level curriculum of study that would lead to a VCCS credential.
The sample includes 333,494 unique students, 34.6% of whom earned a degree within 6 years of degree completion. We randomly divide this full sample into a training set and a validation set.

The Innovation

We compare modeling strategies across a variety of criteria, including their differences in predictive power, which student characteristics are most predictive of completion or withdrawal, and the types of students for whom the models perform the best and worst in terms of predictive accuracy.

Among the model variations we test:
- Traditional OLS and logistic regression methods compared to random forests, gradient boosting, and neural network methods for prediction
- Inclusion/exclusion of: demographic predictors; NSC data; students who attended other non-VCCS institutions after the initial enrollment in VCCS in the training sample; employment information, both pre-enrollment and during enrollment; and students who earned a degree within the first two years of enrollment from training sample.
- Alternative strategies to test how well models can predict eventual degree completion with differing levels of predictive information about the student.
- Models using only overall academic predictors, e.g. cumulative GPA.
- Comparison of results with the outcome specified as any degree, versus associate degree or higher.
- Giving more weight to precision versus recall.