-
Notifications
You must be signed in to change notification settings - Fork 1
Roadmap
milandv edited this page Jul 20, 2021
·
1 revision
20 July 2021
-
Additional Tidystream improvements (?)
-
Evaluate model building +/- embedding
measure: model accuracy, run time, file size for this use Class 1 and Class 2 (or 4. that's unions). So output is 4 models with metrics for their builds.
-
Evaluate model building as a function of input size (do we need this?)
-
Figure out how to set up model builds for all 15 classes
note: this doesn't exactly have to be optimized. need to build 15 models as a one-off even if that takes 15 instances running overnight for 1 day.
-
Evaluate how to deal with orgs having different text sources.
For this, take one class and build 3 models - using only IRS test, using only about page text, using both. Then apply those three models to orgs that have only IRS, only web text, or both. So it's a 3x3 test. The question is which model is best and most efficient to use for each of those three groups.
-
Evaluate batch size for model prediction
Using the 10k random file, test applying a recipe and outputting prediction (as a 0 to 1 probability) for batches of different sizes (500, 1,000 ; 5,000 ; 10,000). Here we're concerned only with run time and memory capacity. The aim is to find a batch size that we can iterate through (in parallel) so that we can output predictions for the complete text data set (~900k orgs or so) for each of the 15 classes.
- Figure out how to set up outputting predicted scores for the complete text data set for each of the 15 models