-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Risklist module for production #631
Conversation
Codecov Report
@@ Coverage Diff @@
## master #631 +/- ##
==========================================
- Coverage 82.99% 82.86% -0.14%
==========================================
Files 87 90 +3
Lines 5763 5865 +102
==========================================
+ Hits 4783 4860 +77
- Misses 980 1005 +25
Continue to review full report at Codecov.
|
src/tests/test_risklist.py
Outdated
model_storage_engine=project_storage.model_storage_engine(), | ||
model_id=model_id, | ||
as_of_date=as_of_date) | ||
table_should_have_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simple assertion was a good start but we should go further. Does it make sense to make assertions about the size of the table? How about the contents? We'd expect all of these rows to have the same date/model id and stuff like that, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if I recall correctly the model_metadata.matrices table will also get a row since we used the MatrixBuilder, we should make sure that row looks reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ProductionMatrixType(object): | ||
string_name = "production" | ||
prediction_obj = ListPrediction | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're introducing a new matrix type, I think is_test
has to be set in some way. How does is_test
get used for the other ones? Depending on how it's used, I could see either False or True making more sense for ProductionMatrixType. is_test
might be a poor name, serving as a proxy for some other type of behavior, and we should rename it in a way that makes a value for ProductionMatrixType more obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is_test
is only used in ModelEvaluator
to decide if it should use testing_metric_groups
or training_metric_groups
. For ProductionMatrixType
, at least for generating risk list, there's no evaluation involved. I'm not sure why it needs is_test
regardless of good or poor naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... would it make sense to remove the is_test
boolean entirely and simply check if matrix_type.string_name == 'test'
in evaluation.py? As it is, it seems a little redundant with the string name, especially if it doesn't fit well with the production matrix type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of comparisons against magic strings. What about making it more specifically about the metric groups? Like what if the field was metric_config_key, and for train it was 'training_metric_groups', for test it was 'testing_metric_groups', and for production it was None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a fair point about avoiding magic strings -- having it keep track of the appropriate metric_groups
types it needs to use makes sense to me if that's the only place is_test
is getting used downstream regardless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just coming into this discussion quite late. It seems to me that the discussion here is exposing decisions we committed to early on and haven't tried to address.
I feel like we always saw the matrix_type
as a bad pattern but implemented it for expediency to flag (poorly) the different behavior of label imputation in train/test for the inspection problem type. Is my memory accurate? There were conceptual issues with this all along because matrix_type was never a good name and it prevented us from reusing matrices from testing for training in the EWS problem type.
It also seems like which metrics to use is not properly a property of the matrix but of the task being performed. Perhaps we should be passing the metrics groups to the evaluator instead of encoding in the matrix what the evaluator should do to it.
I think maybe we can move forward with this as a way to get the feature we want now, but this discussion is making the illogic of these matrix attributes more apparent and we may have some deeper technical debt to service here rather than trying to make the current situation totally coherent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you all discussed this on Tuesday, but here are a couple of possible solutions for Triage v 5.0.0. The first is probably easier to implement, but I think it is the less good solution:
- Eliminate both the
matrix_type
andis_test
properties of matrices - Include missing labels as missing in all matrices
- Replace
include_missing_labels_in_train_as
with two keys:impute_missing_labels_at_train
andimpute_missing_labels_at_test
. These can beTrue
orFalse
. IfTrue
, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points - The Architect writes matrices without label imputation, allowing them to be reused for any purpose
- Catwalk does the label imputation on the fly when it is training or testing and always writes the labels to the database. This has high storage costs, but the reason we had to decide on the label imputation at matrix-building time initially was because we had not yet conceived of the
train_results
schema. Now that Triage can write train predictions, it can write train labels to the database, and thematrix_type
concept is no longer needed. - The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
- Separate matrix storage, rather than separate matrix types, is introduced for Production and Development
An alternative:
- Eliminate both the
matrix_type
andis_test
properties of matrices - Replace
include_missing_labels_in_train_as
with two keys:impute_missing_labels_at_train
andimpute_missing_labels_at_test
. These can beTrue
orFalse
. IfTrue
, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points - Instead of storing labels in the train/test matrices, we begin storing the design matrices and the labels separately. Train labels and test labels have separate stores, and the imputation behavior is a key in the label metadata. This reduces (but does not eliminate as in the first proposal) duplication of storage in the EWS case.
- We introduce new metadata tables for Labels and experiment-label pairs, and add a
labels_id
column to many of the places where we also have amatrix_id
column (models, predictions, evaluations, etc.). This increases db storage a little but not as much as having to store all the train labels in the db. - Include all entities in the cohort in all design matrices
- The
replace
pathway checks the imputation flag on the labels to determine whether the Architect needs to create new labels - Design matrices are combined with the correct labels on reading, and rows are dropped if there is no label and the imputation key is
False
- The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
- You can still skip writing train results to conserve db storage
- Production/predict-forward/retrain-and-predict uses design matrices but not labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the above comment to #855
src/triage/risklist/__init__.py
Outdated
imputation_table_tasks = OrderedDict() | ||
with db_engine.begin() as conn: | ||
for aggregation in collate_aggregations: | ||
feature_prefix = aggregation.prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this 'reconstruct feature dictionary' section could use either some breaking up into smaller functions or some more comments. It seems like we're doing a few steps here, and the code is kind of dense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/triage/risklist/__init__.py
Outdated
|
||
matrix_metadata = { | ||
'as_of_times': [as_of_date], | ||
'matrix_id': str(as_of_date) + '_prediction', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we change 'prediction' to 'risklist' and include the model id in this 'matrix_id'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/triage/risklist/__init__.py
Outdated
replace=True, | ||
) | ||
|
||
matrix_metadata = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think we should re-do the matrix metadata creation a little bit. It should be easier to create than this. You can lean on Planner._make_metadata, although it should be made public since we're using it. In fact, we should potentially make it its own function!
If you look at what's in the usual metadata, you'll see that the metadata we've assembled for these production matrices doesn't really fulfill the pattern we've been establishing with the other matrix types, which is meant for people looking at the metadata later to get useful information out of. For instance, 'end_time', 'cohort_name', and 'feature_names', should totally be in there, and it's easier to put those in (among other keys like indices
and state
that have defaults in Planner) if we lean on the code that already exists.
So, to start, I'd look at fulfilling the contract that Timechop creates with its temporal metadata. See timechop/timechop.py, 'define_train_matrix' and 'define_test_matrices'. This code should create temporal metadata that fits that template somehow which can then be fed into Planner.make_metadata along with the non-temporal information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, in addition to the changes to the core stuff up above, I think this PR should also see:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @thcrock, just a couple little things to add.
|
||
|
||
class ProductionMatrixType(object): | ||
string_name = "production" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verbose_name
or matrix_name
might be more exact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/triage/risklist/__init__.py
Outdated
join model_metadata.models on (models.train_matrix_uuid = matrices.matrix_uuid) | ||
where model_id = %s | ||
""" | ||
results = list(db_engine.execute(get_experiment_query, model_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might just want to use first()
here:
results = list(db_engine.execute(get_experiment_query, model_id)) | |
result = db_engine.execute(get_experiment_query, model_id).first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/triage/risklist/__init__.py
Outdated
with db_engine.begin() as conn: | ||
for aggregation in collate_aggregations: | ||
feature_prefix = aggregation.prefix | ||
feature_group = aggregation.get_table_name(imputed=True).split('.')[1].replace('"', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a helper for this, no? seems icky 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revisiting this; since we are prioritizing this before 608, we probably shouldn't spend too much time revising this as it will change. We may still break things out into functions for clarity, and make intention more clear with comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Can the list respect the subset? i.e., if I pass a subset query (or queries), make a list (or lists) of the top K in the subset(s). |
Regarding subsets: We haven't actually scoped any top-k filtering in this PR, which is I assume where subsets would be more important to take into account. This just computes predictions for the cohort. So you could subset afterwards by joining those tables. We figured in a later pass we could create some utilities to filter this list (e.g. top k, subsets) |
sa.Column("rank_pct", sa.Float(), nullable=True), | ||
sa.Column("matrix_uuid", sa.Text(), nullable=True), | ||
sa.Column("test_label_window", sa.Interval(), nullable=True), | ||
sa.ForeignKeyConstraint(["model_id"], ["results.models.model_id"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This schema should be model_metadata (Alena hit this error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually ended up doing this when I was fixing conflicts (I had to fix the Alembic revisions to get the tests working after conflict fixing)
src/triage/risklist/__init__.py
Outdated
Returns: (dict) a dictionary of all information needed for making the risk list | ||
|
||
""" | ||
get_experiment_query = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this makes sense to rewrite as an ORM query, if for no other reason then to be resilient to the frequent schema changing and table name changing that Triage ends up having.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…es to speed up subsequent tests
@shaycrk Yeah, can you take a look at the new commits? Mostly about the table changes.
|
docs/sources/predictlist/index.md
Outdated
1. A `model_id` from a Triage model that you want to use to generate predictions | ||
2. An `as_of_date` to generate your predictions on. | ||
1. A `model_group_id` from a Triage model group that you want to use to retrain a model and generate prediction | ||
2. A `today` to generate your predictions on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we changed the argument to prediction_date
rather than today
, didn't we?
self.retrain_hash = save_retrain_and_get_hash(retrain_config, self.db_engine) | ||
|
||
with get_for_update(self.db_engine, Retrain, self.retrain_hash) as retrain: | ||
retrain.prediction_date = prediction_date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that we don't want the prediction date to be part of the retrain hash? But then if we re-run the same retrain and predict config with a different prediction date, won't we end up overwriting the date in the table by doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the prediction date is part of the retrain hash. The retrain_config has the prediction date and the retrain hash is generated by save_retrain_and_get_hash
.
src/triage/predictlist/__init__.py
Outdated
self.training_label_timespan = self.experiment_config['temporal_config']['training_label_timespans'][0] | ||
self.test_label_timespan = self.experiment_config['temporal_config']['test_label_timespans'][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these need to be determined with respect to the specific model group, not just the first entry in the experiment config (that might not be the one that was actually used for this model group). Likewise for the max_training_history
) | ||
|
||
def get_temporal_config_for_retrain(self, prediction_date): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're only predicting for a specific date, should probably explicitly set the test_duration
to 0day
?
Thanks @tweddielin! I took a look through the new commits and added a few inline comments about the temporal config, but that's definitely getting a lot closer. A few overall thoughts on the database changes:
|
Thoughts on things I was actually tagged for (as opposed to reviving long dead conversations):
|
op.add_column('experiment_runs', sa.Column('run_hash', sa.Text(), nullable=True), schema='triage_metadata') | ||
op.drop_column('experiment_runs', 'experiment_hash', schema='triage_metadata') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here should we just rename the experiment_hash
column to run_hash
to avoid losing any existing data in the experiment hash column during the migration? Likewise, would be good to set the run_type
for any existing rows as well via an UPDATE
statement:
op.execute("UPATE triage_metadata.experiment_runs SET run_type='experiment'")
schema='triage_metadata', | ||
) | ||
|
||
op.alter_column('models', 'built_in_experiment_run', nullable=False, new_column_name='built_in_triage_run', schema='triage_metadata') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add here (to avoid any potential for data loss):
op.execute("CREATE TABLE triage_metadata.deprecated_models_built_by_experiment AS SELECT model_id, model_hash, built_by_experiment FROM triage_metadata.models")
def downgrade(): | ||
op.execute("ALTER TABLE triage_metadata.triage_runs RENAME TO experiment_runs") | ||
op.drop_column('experiment_runs', 'run_type', schema='triage_metadata') | ||
op.drop_column('experiment_runs', 'run_hash', schema='triage_metadata') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, probably rename this back to experiment_hash
rather than dropping the column
src/triage/predictlist/utils.py
Outdated
get_experiment_query = '''select experiments.config | ||
from triage_metadata.triage_runs | ||
join triage_metadata.models on (triage_runs.id = models.built_in_triage_run) | ||
join triage_metadata.experiments on (experiments.experiment_hash = triage_runs.run_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably shouldn't expect collisions, but it might be good to be explicit here, e.g.
join triage_metadata.experiments on (experiments.experiment_hash = triage_runs.run_hash and triage_runs.run_type='experiment')
src/triage/predictlist/utils.py
Outdated
join triage_metadata.models | ||
on (triage_runs.id = models.built_in_triage_run) | ||
join triage_metadata.experiments | ||
on (experiments.experiment_hash = triage_runs.run_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, might be good to add and triage_runs.run_type='experiment'
to the join condition
src/triage/predictlist/utils.py
Outdated
query = """ | ||
SELECT re.config FROM triage_metadata.models m | ||
LEFT JOIN triage_metadata.triage_runs r ON m.built_in_triage_run = r.id | ||
LEFT JOIN triage_metadata.retrain re on re.retrain_hash = r.run_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, maybe add and triage_runs.run_type='retrain'
to the join
Thanks @tweddielin -- saw you had added a couple of commits so I took a quick pass. It looks pretty good to me overall, just a couple of small inline things and then I think we should be in a good place to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went ahead and made the last few changes, which were pretty minor. I think the logic is generally in good shape and the tests are passing, so will go ahead and merge to master now.
@shaycrk sorry just saw this now!! but thanks for making those changes! |
triage.predictlist
module for generating the prediction in production.predict_forward_with_existed_model()
is a function which takes in amodel_id
and aas_of_date
and it will put the list inpredictions
table intriage_production
schema.Retrainer
class is created for a more general purpose of retraining a model given amodel_group_id
using all data up to the current date and then predict forward.Retrainer.retrain()
takes in atoday
argument, since I don't want to confuseas_of_date
in the training phase with it. Theretrain
method will go back one time split fromtoday
where theas_of_date
istoday
-training_label_timespan
to create matrix for training. It will use all the info from the experiment config of themodel_group_id
to retrain a model, save the model in<project_path>/trained_models
directory with a newmodel_hash
and store a new record(model_id
) intriage_metadata.models
withmodel_comment
=retrain_<today's date>
to differentiate from other models with the samemodel_group_id
.predict
method also takes in atoday
argument. Based on the experiment_config and the newmodel_id
, it will create a test matrix up totoday
and use the retrained model to make predictions. The results are stored intriage_production.predictions
andtriage_production.prediction_metadata
.