-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Risklist module for production #631
Merged
Merged
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
a563694
listmaking WIP
thcrock 9750c3e
forgot migraton
thcrock 360f8f9
WIP
tweddielin 999a46f
alembic add label_value to list_predictions table
tweddielin 372d9c8
add docstrings
tweddielin 16645bc
move risklist a layer above
tweddielin 914ad76
create risklist module
tweddielin c92bd8b
__init__lpy
tweddielin d3c3ba9
fix alembic reversion and replace metta.generate_uuid with filename_f…
tweddielin 0e92fb0
Fix down revision of production schema migration
thcrock dbd4578
Fix alembic revisions
thcrock f7d49e5
Enable github checks on this branch too
thcrock dee930f
Closer to getting tests to run
thcrock 1769b00
Add CLI for risklist
thcrock 52c9ff0
Risklist docs stub
thcrock 173167a
Break up data gathering into experiment and matrix, use pytest fixtur…
thcrock f6b2d02
Modify schema for list prediction metadata
thcrock acffa67
fix conflicts and add helper functions for getting imputed features
tweddielin 43c1919
Handle other imputation flag cases, fix tracking indentation error
thcrock 7dfb7e1
Add more tests, fill out doc page
thcrock cc9fe4a
Fix exception name typo
thcrock 5951565
use timechop and planner to create matrix_metadata for production
tweddielin 537f6c8
retrain and predict forward
tweddielin b429540
rename to retrain_definition
tweddielin 0045aa5
reusing random seeds from existing models
shaycrk 9dc3697
fix tests (write experiment to test db)
shaycrk da870d5
unit test for reusing model random seeds
shaycrk 6768ee5
add docstring
shaycrk 7d6a420
only store random seed in experiment runs
shaycrk b8fe6d8
DB migration to remove random seed from experiments table
shaycrk 8207fcd
debugging
shaycrk 45c9d68
debug model trainer tests
shaycrk a665e7e
debug catwalk utils tests
shaycrk ead882b
debug catwalk integration test
shaycrk de85f10
use public method
tweddielin ad860cd
Merge remote-tracking branch 'origin/kit_rand_seed' into list_making
tweddielin 40466d5
alembic merge
tweddielin 83c7385
reuse random seed
tweddielin f97089b
use timechop for getting retrain information
tweddielin 6f0af1c
create retrain model hash in retrain level instead of model_trainer l…
tweddielin 42bccaa
move util functions to utils
tweddielin 3ec377f
fix cli and docs
tweddielin 1c4da24
update docs
tweddielin 35bd978
use reconstructed feature dict
tweddielin 9f5a099
add RetrainModel and Retrain
tweddielin ba84822
remove break point
tweddielin 83e0f66
change experiment_runs to triage_runs
tweddielin d6f14f5
get retrain_config
tweddielin d76359b
explicitly include run_type in joins to triage_runs
shaycrk 9698500
DB migration updates
shaycrk a8a29f1
update argument name in docs
shaycrk 694edcc
ensure correct temporal config is used for predicting forward
shaycrk 583e9bd
debug
shaycrk 815a258
debug
shaycrk 5e183fe
Merge branch 'master' into list_making
shaycrk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# Retrain and Predict | ||
Use an existing model group to retrain a new model on all the data up to the current date and then predict forward into the future. | ||
|
||
## Examples | ||
Both examples assume you have already run a Triage Experiment in the past, and know these two pieces of information: | ||
1. A `model_group_id` from a Triage model group that you want to use to retrain a model and generate prediction | ||
2. A `prediction_date` to generate your predictions on. | ||
|
||
### CLI | ||
`triage retrainpredict <model_group_id> <prediction_date>` | ||
|
||
Example: | ||
`triage retrainpredict 30 2021-04-04` | ||
|
||
The `retrainpredict` will assume the current path to be the 'project path' to train models and write matrices, but this can be overridden by sending the `--project-path` option | ||
|
||
### Python | ||
The `Retrainer` class from `triage.predictlist` module can be used to retrain a model and predict forward. | ||
|
||
```python | ||
from triage.predictlist import Retrainer | ||
from triage import create_engine | ||
|
||
retrainer = Retrainer( | ||
db_engine=create_engine(<your-db-info>), | ||
project_path='/home/you/triage/project2' | ||
model_group_id=36, | ||
) | ||
retrainer.retrain(prediction_date='2021-04-04') | ||
retrainer.predict(prediction_date='2021-04-04') | ||
|
||
``` | ||
|
||
## Output | ||
The retrained model is sotred similariy to the matrices created during an Experiment: | ||
- Raw Matrix saved to the matrices directory in project storage | ||
- Raw Model saved to the trained_model directory in project storage | ||
- Retrained Model info saved in a table (triage_metadata.models) where model_comment = 'retrain_2021-04-04 21:19:09.975112' | ||
- Predictions saved in a table (triage_production.predictions) | ||
- Prediction metadata (tiebreaking, random seed) saved in a table (triage_produciton.prediction_metadata) | ||
|
||
|
||
# Predictlist | ||
If you would like to generate a list of predictions on already-trained Triage model with new data, you can use the 'Predictlist' module. | ||
|
||
# Predict Foward with Existed Model | ||
Use an existing model object to generate predictions on new data. | ||
|
||
## Examples | ||
Both examples assume you have already run a Triage Experiment in the past, and know these two pieces of information: | ||
1. A `model_id` from a Triage model that you want to use to generate predictions | ||
2. An `as_of_date` to generate your predictions on. | ||
|
||
### CLI | ||
`triage predictlist <model_id> <as_of_date>` | ||
|
||
Example: | ||
`triage predictlist 46 2019-05-06` | ||
|
||
The predictlist will assume the current path to be the 'project path' to find models and write matrices, but this can be overridden by sending the `--project-path` option. | ||
|
||
### Python | ||
|
||
The `predict_forward_with_existed_model` function from the `triage.predictlist` module can be used similarly to the CLI, with the addition of the database engine and project storage as inputs. | ||
``` | ||
from triage.predictlist import generate predict_forward_with_existed_model | ||
from triage import create_engine | ||
|
||
predict_forward_with_existed_model( | ||
db_engine=create_engine(<your-db-info>), | ||
project_path='/home/you/triage/project2' | ||
model_id=46, | ||
as_of_date='2019-05-06' | ||
) | ||
``` | ||
|
||
## Output | ||
The Predictlist is stored similarly to the matrices created during an Experiment: | ||
- Raw Matrix saved to the matrices directory in project storage | ||
- Predictions saved in a table (triage_production.predictions) | ||
- Prediction metadata (tiebreaking, random seed) saved in a table (triage_production.prediction_metadata) | ||
|
||
## Notes | ||
- The cohort and features for the Predictlist are all inferred from the Experiment that trained the given model_id (as defined by the experiment_models table). | ||
- The feature list ensures that imputation flag columns are present for any columns that either needed to be imputed in the training process, or that needed to be imputed in the predictlist dataset. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
from triage.predictlist import Retrainer, predict_forward_with_existed_model, train_matrix_info_from_model_id, experiment_config_from_model_id | ||
from triage.validation_primitives import table_should_have_data | ||
|
||
|
||
def test_predict_forward_with_existed_model_should_write_predictions(finished_experiment): | ||
# given a model id and as-of-date <= today | ||
# and the model id is trained and is linked to an experiment with feature and cohort config | ||
# generate records in triage_production.predictions | ||
# the # of records should equal the size of the cohort for that date | ||
model_id = 1 | ||
as_of_date = '2014-01-01' | ||
predict_forward_with_existed_model( | ||
db_engine=finished_experiment.db_engine, | ||
project_path=finished_experiment.project_storage.project_path, | ||
model_id=model_id, | ||
as_of_date=as_of_date | ||
) | ||
table_should_have_data( | ||
db_engine=finished_experiment.db_engine, | ||
table_name="triage_production.predictions", | ||
) | ||
|
||
|
||
def test_predict_forward_with_existed_model_should_be_same_shape_as_cohort(finished_experiment): | ||
model_id = 1 | ||
as_of_date = '2014-01-01' | ||
predict_forward_with_existed_model( | ||
db_engine=finished_experiment.db_engine, | ||
project_path=finished_experiment.project_storage.project_path, | ||
model_id=model_id, | ||
as_of_date=as_of_date) | ||
|
||
num_records_matching_cohort = finished_experiment.db_engine.execute( | ||
f'''select count(*) | ||
from triage_production.predictions | ||
join triage_production.cohort_{finished_experiment.config['cohort_config']['name']} using (entity_id, as_of_date) | ||
''' | ||
).first()[0] | ||
|
||
num_records = finished_experiment.db_engine.execute( | ||
'select count(*) from triage_production.predictions' | ||
).first()[0] | ||
assert num_records_matching_cohort == num_records | ||
|
||
|
||
def test_predict_forward_with_existed_model_matrix_record_is_populated(finished_experiment): | ||
model_id = 1 | ||
as_of_date = '2014-01-01' | ||
predict_forward_with_existed_model( | ||
db_engine=finished_experiment.db_engine, | ||
project_path=finished_experiment.project_storage.project_path, | ||
model_id=model_id, | ||
as_of_date=as_of_date) | ||
|
||
matrix_records = list(finished_experiment.db_engine.execute( | ||
"select * from triage_metadata.matrices where matrix_type = 'production'" | ||
)) | ||
assert len(matrix_records) == 1 | ||
|
||
|
||
def test_experiment_config_from_model_id(finished_experiment): | ||
model_id = 1 | ||
experiment_config = experiment_config_from_model_id(finished_experiment.db_engine, model_id) | ||
assert experiment_config == finished_experiment.config | ||
|
||
|
||
def test_train_matrix_info_from_model_id(finished_experiment): | ||
model_id = 1 | ||
(train_matrix_uuid, matrix_metadata) = train_matrix_info_from_model_id(finished_experiment.db_engine, model_id) | ||
assert train_matrix_uuid | ||
assert matrix_metadata | ||
|
||
|
||
def test_retrain_should_write_model(finished_experiment): | ||
# given a model id and prediction_date | ||
# and the model id is trained and is linked to an experiment with feature and cohort config | ||
# create matrix for retraining a model | ||
# generate records in production models | ||
# retrain_model_hash should be the same with model_hash in triage_metadata.models | ||
model_group_id = 1 | ||
prediction_date = '2014-03-01' | ||
|
||
retrainer = Retrainer( | ||
db_engine=finished_experiment.db_engine, | ||
project_path=finished_experiment.project_storage.project_path, | ||
model_group_id=model_group_id, | ||
) | ||
retrain_info = retrainer.retrain(prediction_date) | ||
model_comment = retrain_info['retrain_model_comment'] | ||
|
||
records = [ | ||
row | ||
for row in finished_experiment.db_engine.execute( | ||
f"select model_hash from triage_metadata.models where model_comment = '{model_comment}'" | ||
) | ||
] | ||
assert len(records) == 1 | ||
assert retrainer.retrain_model_hash == records[0][0] | ||
|
||
retrainer.predict(prediction_date) | ||
|
||
table_should_have_data( | ||
db_engine=finished_experiment.db_engine, | ||
table_name="triage_production.predictions", | ||
) | ||
|
||
matrix_records = list(finished_experiment.db_engine.execute( | ||
f"select * from triage_metadata.matrices where matrix_uuid = '{retrainer.predict_matrix_uuid}'" | ||
)) | ||
assert len(matrix_records) == 1 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
- Predictlist:
(I can't remember if YAML caries about the whitespace between the-
and list item)?