top of page

Overview

Our aim is to classify whether a college applicant will get accepted by at least one school, or get waitlisted/rejected by every school when looking at the set of Harvard, Stanford, Princeton, and Stanford. An accurate classification may narrow down the list of schools that an applicant may target, resulting in a more efficient use of time and effort into their application, as well as lowering the total cost of application fees. This will also provide insight into the important factors and thresholds in assessing a candidate for college admission.

​

We used decision trees, nearest neighbor, and simple logistic regression in order to output a classification from a mix of categorical and continuous variables . Logistic regression performed the best on our test set, and the most important features that were weighted the most were unweighted GPA and select minority ethnicities.

Figure 1: Visualization of data with the features of unweighted GPA (modified to have a maximum value of 1) and SAT scores.

About the Raw Data

The data come from a College Confidential thread, where students self-report admission results and personal statistics. The information was manually read through and inputted into a format to feed into deciders. The entire list of features collected is gender, race, location, weighted and unweighted GPA, school type (public/private), class rank, SAT I and SAT II scores, ACT scores, extracurriculars, and whether they applied Early Decision. For classification labels, the results of accepted, waitlisted, and rejected from the set of schools each student applied to were collected.

​

Above, Figure 1 shows a high-level visualization of our data through the display of just two features, unweighted GPA and SAT scores, with the class labels.

​

In total, there were 421 observations. The data is split 80/20 for training/testing, coming out to 337 observations for to train and validate, and 84 observations to test. To evaluate the models, leave-one-out-cross-validation is used due to the smaller set of data.

About the Processed Data

The data were pre-processed before feeding them into deciders. Gender, race, location, unweighted GPA, school type, SAT I scores, and Early Decision were kept unchanged. A new feature expressing if they were a Valedictorian or Salutatorian from class rank, and class rank was converted into a percentile. ACT scores were converted to SAT scores through conversion tables found in this pdf. SAT II scores were averaged together.

Results

With the help of Weka.

​

Baseline

ZeroR accuracy is 61%. This, along with Figure 1 helps note that the sample of applicants found on College Confidential is unusually accomplished on average.

​

Simple Logistic Regression

Using logistic regression gave the highest cross-validation accuracy of 69%, and had an accuracy of 68% on the test set.

​

Decision Tree (J48)

The unpruned tree's accuracy achieved a cross-validation accuracy of 62%.

​

​

​

​

​

​

 

Figure 2: Stylization of learned decision tree

​

Nearest Neighbor (IBk)

Applicants go on the forum to gauge their chances in gaining admission to certain schools by comparing themselves to similar results, so this is a natural algorithm to use here. 3-NN improves J48's cross-validation accuracy by achieving 65%.

Full Report

This page includes an overview of our project, accompanied by some detail. Please review the detailed report here.

bottom of page