Skip to content

Roadmap

milandv edited this page Jul 20, 2021 · 1 revision

20 July 2021

1. Model Training

  • Additional Tidystream improvements (?)

  • Evaluate model building +/- embedding

    measure: model accuracy, run time, file size
    
    for this use Class 1 and Class 2 (or 4. that's unions). So output is 4 models with metrics for their builds. 
    
  • Evaluate model building as a function of input size (do we need this?)

2. AWS/server setup for Model Training

  • Figure out how to set up model builds for all 15 classes

    note: this doesn't exactly have to be optimized. need to build 15 models as a one-off even if 
    that takes 15 instances running overnight for 1 day.
    

3. Scoring Orgs

  • Evaluate how to deal with orgs having different text sources.

    For this, take one class and build 3 models - using only IRS test, using only about page text, using both. 
    Then apply those three models to orgs that have only IRS, only web text, or both. So it's a 3x3 test. 
    The question is which model is best and most efficient to use for each of those three groups.
    
  • Evaluate batch size for model prediction

    Using the 10k random file, test applying a recipe and outputting prediction (as a 0 to 1 probability) for 
    batches of different sizes (500, 1,000 ; 5,000 ; 10,000). Here we're concerned only with run time and memory 
    capacity. The aim is to find a batch size that we can iterate through (in parallel) so that we can output 
    predictions for the complete text data set (~900k orgs or so) for each of the 15 classes.
    

4. AWS/server setup for Model Application

  • Figure out how to set up outputting predicted scores for the complete text data set for each of the 15 models