- Configure a postgres database with the KDD Cup 2014 DonorsChoose dataset, using the tables
donations.csv
,essays.csv
,projects.csv
, andresources.csv
Note: Kaggle hosts a more recent DonorsChoose dataset, which includes data from as recent as May 2018. It includes a similar set of variables, but in a different schema.
-
Create a new python environment and install python prerequisites from requirements.txt:
`pip install -r requirements.txt`
-
Create a database.yaml file with your credentials.
- Run
database_preparation.py
. This executes the queries indatabase_prep_queries
, against your configured database, improving database performance and generating several time-aggregate features. - Run main.py. This will run the Triage experiment defined in donors-choose-config.yaml.
- Run model_selection.ipynb. Be sure to update
experiment_id
to match the experiment hash generated by step 1.
DonorsChoose is a nonprofit that addresses the education funding gap through crowdfunding. Since 2000, they have facilitated $970 million in donations to 40 million students in the United States.
However, approximately one third of all projects posted on DonorsChoose do not reach their funding goal within four months of posting.
This project will help DonorsChoose shrink the education funding gap by ensuring that more projects reach their funding goals. We will create an early warning system that identifies newly-posted projects that are likely to fail to meet their funding goals, allowing DonorsChoose to target those projects with an intervention such as a donation matching grant.
We use four tables from the DonorsChoose database:
Name | Description | Primary Key | Used? |
---|---|---|---|
projects | Basic metadata including teacher, class, and school information, and project asking price. | projectid | yes |
resources | Information about the classroom supply or resource to be funded. Product category, per-unit price, quantity requested, etc. | projectid | yes |
essays | Text of funding request. | projectid | yes |
donations | Table of donations. Donor information, donation amount, messages from donors. Zero to many rows per project. | donationid | yes |
We performed some initial processing of the source database to improve database performance and ensure compliance with Triage. The altered tables are stored in a second database schema, leaving the raw data intact.
Triage expects each feature and label row to be identified by a primary key called entity_id. For convenience, we renamed projectid
(our entity primary key) to entity_id
.
We replaced the source database's string (postgres varchar(32)
) projectid key with integer keys. Triage requires integer entityids, and integer keys will improve performance on joins and group operations.
We create primary key constraints on projectid in all tables (and a foreign key constraint on donations.projectid). This creates indexes on each of those columns, improving performance in label & feature generation.
Let's start by stating our qualitative understanding of the problem. Then, we'll translate that into a formal problem framing, using the language of the Triage experiment config file.
Once a DonorsChoose project has been posted, it can receive donations for four months. If it doesn't reach its funding goal by then, it is considered unfunded.
DonorsChoose wants to institute a program where a group of projects at risk of falling short on funding are selected to receive extra support: enrollment in a matching grant program funded by DonorsChoose's corporate partners, and prominent placement on the DonorsChoose project discovery page.
These interventions have limited capacity: funding is limited, and only a few projects at a time can really benefit from extra promotion on the homepage. Each day, DonorsChoose chooses a few newly-posted projects to be enrolled in these interventions, based on information in their application for funding. Each month, 50 projects in total are enrolled in the interventions.
Therefore, our goal is to identify a machine learning model that identifies the 50 projects posted each month that are most likely to fail to reach their funding goal.
We'll use the earliest available data in feature generation. Historical information on project performance is likely useful in predicting the performance of new projects in the same locations, or under the same teachers.
feature_start_time: '2000-01-01'
feature_end_time: '2013-06-01'
We're most interested in evaluating the performance of our models on data from recent years. We select a dataset starting in mid-2011, after this period of growth, and running through the end of 2013, the last compete year of data.
label_start_time: '2011-09-02'
label_end_time: '2013-06-01'
Starting our label range with September 1, 2011 causes Triage to generate a useless 13th training set which contains a single day's worth of projects. We start our data on September 2, 2011 to avoid this.
Each month, the previous month's data becomes available for training a new model.
model_update_frequency: '1month'
Our model will make predictions once a month, on the previous month's unlabeled data. Our one month test set length reflects this.
test_durations:['1month']
Patterns in the DonorsChoose data can change significantly within a year. We use one-month training sets ensuring that our models capture trends from recent data.
max_training_histories: ['1month']
When the model is in production, DonorsChoose will evaluate new projects daily. We use a 1 day as of date frequency to simulate the rate at which DonorsChoose will access the model's predictions.
training_as_of_date_frequencies: ['1day']
test_as_of_date_frequencies: ['1day']
A project's label timespan is the amount of time that must pass from when it is posted, to when its label can be determined. In our case, each project has a four month label timespan.
training_label_timespans: ['4month']
test_label_timespans: ['4month']
Here's a timechop diagram representing our temporal config:
Under our framing, each project can have one of two outcomes:
- Fully funded: Total donations in the four months following posting were equal to or greater than the requested amount
- Not fully funded: Total donations in the four months following posting were less than the requested amount.
We generate our label with a query that sums total donations to each project, and calculates a binary variable representing whether the project went unfunded (1
) or met its goal (0
).
Since our intervention is resource-constrained and limited to 50 projects each month, we are concerned with minimizing false positives. We track how our models perform on precision among the 50 projects predicted at highest risk of going unfunded.
We implement two categories of features. The first are features that we read directly from the database, raw, or with basic transformations. These include information like teacher and school demographics, type and price of requested classroom resources, and essay length.
Triage can generate these features directly from our source data, without us performing any manual transforms or aggregations.
The second category of features are temporal aggregations of historical donation information. These answer questions like "how did a posting teacher's previous projects perform?" and "how did previous projects at the originating school perform?"
These aggregations would be too complex to perform with Triage's feature aggregation system. So we generate them manually and store them alongside the source data.
The DDL statements that create these features are stored in precompute_queries
Note: in
donors-choose-config.yaml
, we define all feature aggregates over the default intervalall
. This parameter isn't relevant to this project, because all of our time-aggregate features are calculated outside of Triage.
Our model grid includes three model function candidates, and three baseline model specifications.
Model function candidates:
sklearn.ensemble.RandomForestClassifier
sklearn.linear_model.LogisticRegression
sklearn.tree.DecisionTreeClassifier
Baselines:
sklearn.tree.DecisionTreeClassifier
(max_depth = 2)sklearn.dummy.DummyClassifier
(predicting our label's base rate)- Triage's
PercentileRankOneFeature
, which ranks entities based on the value of a single feature (here, project total asking price)
We use Auditioner to manage model selection. Plotting precision@50_abs over time shows that our models groups are generally working well, with most performing better than baselines.
Our logistic regression model groups tend to perform worse than our random forests. The difference in performance (as much as .25) doesn't justify a tradeoff for the models' potential higher interpretability. Plain decision trees also seem to perform consistently worse than random forests.
We use Auditioner to perform some coarse filtering, eliminating the worst-performing model groups:
- Dropping model groups that achieved precision@50 worse than 0.5 in at least one test set
- Dropping model groups that had a regret (difference in performance from the best-performing model group) of 0.2 or greater during at least one month
Performance in the resulting set of model groups ranges from ~0.5 to 0.8, well above the prior rate of ~0.3. Looking pretty good so far.
Building a basic Auditioner model selection grid, it looks like variance-penalized average precision (penalty = 0.5) and best current precision minimize regret.
Criteria | Average regret (precision @ 50_abs) |
---|---|
Best current precision | 0.0905 |
Most frequently within .1 of best | 0.0942 |
Most frequently within .03 of best | 0.0996 |
Random model group | 0.1087 |
Best current precision, which selects the best-performing model group to serve as the predictor for the next month, minimizes average regret, and beats a random baseline.
This criteria selects three random forest model groups for the next period:
max_depth | max_features | n_estimators | min_samples_split | |
---|---|---|---|---|
RandomForestClassifier | 5 | 12 | 1000 | 25 |
RandomForestClassifier | 10 | 12 | 1000 | 50 |
RandomForestClassifier | 10 | 12 | 10000 | 50 |