Roadmap

20 July 2021

1. Model Training

Additional Tidystream improvements (?)

Evaluate model building +/- embedding

measure: model accuracy, run time, file size

for this use Class 1 and Class 2 (or 4. that's unions). So output is 4 models with metrics for their builds.

Evaluate model building as a function of input size (do we need this?)

2. AWS/server setup for Model Training

Figure out how to set up model builds for all 15 classes

note: this doesn't exactly have to be optimized. need to build 15 models as a one-off even if 
that takes 15 instances running overnight for 1 day.

3. Scoring Orgs

Evaluate how to deal with orgs having different text sources.

For this, take one class and build 3 models - using only IRS test, using only about page text, using both. 
Then apply those three models to orgs that have only IRS, only web text, or both. So it's a 3x3 test. 
The question is which model is best and most efficient to use for each of those three groups.

Evaluate batch size for model prediction

Using the 10k random file, test applying a recipe and outputting prediction (as a 0 to 1 probability) for 
batches of different sizes (500, 1,000 ; 5,000 ; 10,000). Here we're concerned only with run time and memory 
capacity. The aim is to find a batch size that we can iterate through (in parallel) so that we can output 
predictions for the complete text data set (~900k orgs or so) for each of the 15 classes.

4. AWS/server setup for Model Application

Figure out how to set up outputting predicted scores for the complete text data set for each of the 15 models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap

1. Model Training

2. AWS/server setup for Model Training

3. Scoring Orgs

4. AWS/server setup for Model Application

Clone this wiki locally