Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Risklist module for production #631

Merged
merged 55 commits into from
Aug 27, 2021
Merged

Risklist module for production #631

merged 55 commits into from
Aug 27, 2021

Conversation

tweddielin
Copy link
Contributor

@tweddielin tweddielin commented Mar 13, 2019

  • Created triage.predictlist module for generating the prediction in production.
  • predict_forward_with_existed_model() is a function which takes in a model_id and a as_of_date and it will put the list in predictions table in triage_production schema.
  • A Retrainer class is created for a more general purpose of retraining a model given a model_group_id using all data up to the current date and then predict forward.
  • Retrainer.retrain() takes in a today argument, since I don't want to confuse as_of_date in the training phase with it. The retrain method will go back one time split from today where the as_of_date is today - training_label_timespan to create matrix for training. It will use all the info from the experiment config of the model_group_id to retrain a model, save the model in <project_path>/trained_models directory with a new model_hash and store a new record(model_id) in triage_metadata.models with model_comment = retrain_<today's date> to differentiate from other models with the same model_group_id.
  • predict method also takes in a today argument. Based on the experiment_config and the new model_id, it will create a test matrix up to today and use the retrained model to make predictions. The results are stored in triage_production.predictions and triage_production.prediction_metadata .

@codecov-io
Copy link

codecov-io commented Mar 13, 2019

Codecov Report

Merging #631 into master will decrease coverage by 0.13%.
The diff coverage is 76.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #631      +/-   ##
==========================================
- Coverage   82.99%   82.86%   -0.14%     
==========================================
  Files          87       90       +3     
  Lines        5763     5865     +102     
==========================================
+ Hits         4783     4860      +77     
- Misses        980     1005      +25
Impacted Files Coverage Δ
...alembic/versions/1b990cbc04e4_production_schema.py 0% <0%> (ø)
...64786a9fe85_add_label_value_to_prodcution_table.py 0% <0%> (ø)
...c/triage/component/architect/feature_generators.py 84.58% <100%> (+0.11%) ⬆️
src/triage/risklist/__init__.py 100% <100%> (ø)
src/triage/component/results_schema/schema.py 97.77% <100%> (+0.01%) ⬆️
src/triage/component/architect/builders.py 89.13% <84.61%> (+0.4%) ⬆️
src/triage/component/catwalk/storage.py 92.63% <88.88%> (-0.18%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 041f568...56cf3ae. Read the comment docs.

model_storage_engine=project_storage.model_storage_engine(),
model_id=model_id,
as_of_date=as_of_date)
table_should_have_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simple assertion was a good start but we should go further. Does it make sense to make assertions about the size of the table? How about the contents? We'd expect all of these rows to have the same date/model id and stuff like that, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if I recall correctly the model_metadata.matrices table will also get a row since we used the MatrixBuilder, we should make sure that row looks reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class ProductionMatrixType(object):
string_name = "production"
prediction_obj = ListPrediction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're introducing a new matrix type, I think is_test has to be set in some way. How does is_test get used for the other ones? Depending on how it's used, I could see either False or True making more sense for ProductionMatrixType. is_test might be a poor name, serving as a proxy for some other type of behavior, and we should rename it in a way that makes a value for ProductionMatrixType more obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is_test is only used in ModelEvaluator to decide if it should use testing_metric_groups or training_metric_groups. For ProductionMatrixType, at least for generating risk list, there's no evaluation involved. I'm not sure why it needs is_test regardless of good or poor naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... would it make sense to remove the is_test boolean entirely and simply check if matrix_type.string_name == 'test' in evaluation.py? As it is, it seems a little redundant with the string name, especially if it doesn't fit well with the production matrix type here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of comparisons against magic strings. What about making it more specifically about the metric groups? Like what if the field was metric_config_key, and for train it was 'training_metric_groups', for test it was 'testing_metric_groups', and for production it was None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair point about avoiding magic strings -- having it keep track of the appropriate metric_groups types it needs to use makes sense to me if that's the only place is_test is getting used downstream regardless.

Copy link
Contributor

@ecsalomon ecsalomon Aug 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just coming into this discussion quite late. It seems to me that the discussion here is exposing decisions we committed to early on and haven't tried to address.

I feel like we always saw the matrix_type as a bad pattern but implemented it for expediency to flag (poorly) the different behavior of label imputation in train/test for the inspection problem type. Is my memory accurate? There were conceptual issues with this all along because matrix_type was never a good name and it prevented us from reusing matrices from testing for training in the EWS problem type.

It also seems like which metrics to use is not properly a property of the matrix but of the task being performed. Perhaps we should be passing the metrics groups to the evaluator instead of encoding in the matrix what the evaluator should do to it.

I think maybe we can move forward with this as a way to get the feature we want now, but this discussion is making the illogic of these matrix attributes more apparent and we may have some deeper technical debt to service here rather than trying to make the current situation totally coherent.

Copy link
Contributor

@ecsalomon ecsalomon Aug 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you all discussed this on Tuesday, but here are a couple of possible solutions for Triage v 5.0.0. The first is probably easier to implement, but I think it is the less good solution:

  • Eliminate both the matrix_type and is_test properties of matrices
  • Include missing labels as missing in all matrices
  • Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
  • The Architect writes matrices without label imputation, allowing them to be reused for any purpose
  • Catwalk does the label imputation on the fly when it is training or testing and always writes the labels to the database. This has high storage costs, but the reason we had to decide on the label imputation at matrix-building time initially was because we had not yet conceived of the train_results schema. Now that Triage can write train predictions, it can write train labels to the database, and the matrix_type concept is no longer needed.
  • The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
  • Separate matrix storage, rather than separate matrix types, is introduced for Production and Development

An alternative:

  • Eliminate both the matrix_type and is_test properties of matrices
  • Replace include_missing_labels_in_train_as with two keys: impute_missing_labels_at_train and impute_missing_labels_at_test. These can be True or False. If True, you must also specify the value. We already have lots of duplicated keys for train/test behavior, so this fits an existing configuration pattern. These can be set to defaults when there are inspection and EWS entry points
  • Instead of storing labels in the train/test matrices, we begin storing the design matrices and the labels separately. Train labels and test labels have separate stores, and the imputation behavior is a key in the label metadata. This reduces (but does not eliminate as in the first proposal) duplication of storage in the EWS case.
  • We introduce new metadata tables for Labels and experiment-label pairs, and add a labels_id column to many of the places where we also have a matrix_id column (models, predictions, evaluations, etc.). This increases db storage a little but not as much as having to store all the train labels in the db.
  • Include all entities in the cohort in all design matrices
  • The replace pathway checks the imputation flag on the labels to determine whether the Architect needs to create new labels
  • Design matrices are combined with the correct labels on reading, and rows are dropped if there is no label and the imputation key is False
  • The ModelEvaluator takes only a single set of metrics, and Catwalk creates a separate ModelEvaluator object for training and testing.
  • You can still skip writing train results to conserve db storage
  • Production/predict-forward/retrain-and-predict uses design matrices but not labels

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the above comment to #855

imputation_table_tasks = OrderedDict()
with db_engine.begin() as conn:
for aggregation in collate_aggregations:
feature_prefix = aggregation.prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this 'reconstruct feature dictionary' section could use either some breaking up into smaller functions or some more comments. It seems like we're doing a few steps here, and the code is kind of dense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


matrix_metadata = {
'as_of_times': [as_of_date],
'matrix_id': str(as_of_date) + '_prediction',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we change 'prediction' to 'risklist' and include the model id in this 'matrix_id'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace=True,
)

matrix_metadata = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think we should re-do the matrix metadata creation a little bit. It should be easier to create than this. You can lean on Planner._make_metadata, although it should be made public since we're using it. In fact, we should potentially make it its own function!

If you look at what's in the usual metadata, you'll see that the metadata we've assembled for these production matrices doesn't really fulfill the pattern we've been establishing with the other matrix types, which is meant for people looking at the metadata later to get useful information out of. For instance, 'end_time', 'cohort_name', and 'feature_names', should totally be in there, and it's easier to put those in (among other keys like indices and state that have defaults in Planner) if we lean on the code that already exists.

So, to start, I'd look at fulfilling the contract that Timechop creates with its temporal metadata. See timechop/timechop.py, 'define_train_matrix' and 'define_test_matrices'. This code should create temporal metadata that fits that template somehow which can then be fed into Planner.make_metadata along with the non-temporal information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thcrock
Copy link
Contributor

thcrock commented Mar 13, 2019

Oh, in addition to the changes to the core stuff up above, I think this PR should also see:

  • CLI integration, including a test. See src/tests/test_cli.py for how we do tests of the CLI
  • Its own page on the documentation site explaining about how to use it
    @thcrock

Copy link
Member

@jesteria jesteria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @thcrock, just a couple little things to add.



class ProductionMatrixType(object):
string_name = "production"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose_name or matrix_name might be more exact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join model_metadata.models on (models.train_matrix_uuid = matrices.matrix_uuid)
where model_id = %s
"""
results = list(db_engine.execute(get_experiment_query, model_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might just want to use first() here:

Suggested change
results = list(db_engine.execute(get_experiment_query, model_id))
result = db_engine.execute(get_experiment_query, model_id).first()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/triage/risklist/__init__.py Outdated Show resolved Hide resolved
with db_engine.begin() as conn:
for aggregation in collate_aggregations:
feature_prefix = aggregation.prefix
feature_group = aggregation.get_table_name(imputed=True).split('.')[1].replace('"', '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a helper for this, no? seems icky 😸

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section will have to be redone once #608 is merged so I think its implementation will change. Or maybe #608 will have to be changed once this is merged!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this; since we are prioritizing this before 608, we probably shouldn't spend too much time revising this as it will change. We may still break things out into functions for clarity, and make intention more clear with comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ecsalomon
Copy link
Contributor

Question: Can the list respect the subset? i.e., if I pass a subset query (or queries), make a list (or lists) of the top K in the subset(s).

@thcrock
Copy link
Contributor

thcrock commented Mar 15, 2019

Regarding subsets: We haven't actually scoped any top-k filtering in this PR, which is I assume where subsets would be more important to take into account. This just computes predictions for the cohort. So you could subset afterwards by joining those tables. We figured in a later pass we could create some utilities to filter this list (e.g. top k, subsets)

sa.Column("rank_pct", sa.Float(), nullable=True),
sa.Column("matrix_uuid", sa.Text(), nullable=True),
sa.Column("test_label_window", sa.Interval(), nullable=True),
sa.ForeignKeyConstraint(["model_id"], ["results.models.model_id"]),
Copy link
Contributor

@thcrock thcrock Apr 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schema should be model_metadata (Alena hit this error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ended up doing this when I was fixing conflicts (I had to fix the Alembic revisions to get the tests working after conflict fixing)

Returns: (dict) a dictionary of all information needed for making the risk list

"""
get_experiment_query = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this makes sense to rewrite as an ORM query, if for no other reason then to be resilient to the frequent schema changing and table name changing that Triage ends up having.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tweddielin
Copy link
Contributor Author

Hey @tweddielin -- just wanted to quickly check back in on this PR. Let me know if I can help out at all!

@shaycrk Yeah, can you take a look at the new commits? Mostly about the table changes.

  • I added Retrain and RetrainModel to the triage_metadata schema, similar to Experiment and ExperimentModel in terms of the relationship.
  • I also added run_type, retrain_hash to ExperimentRun. (I haven't renamed it to TriageRun yet. Since it's gonna affect a lot of codes, I wonder if we want to do it in another branch or in the next commit after you have taken a look at the changes.)
  • built_by_retrain and built_in_triage_run are added to Model
  • Behavior-wise, when Retrainer is run, a new run_id will be added to ExperimentRun (will rename to TriageRun later) with a valid retrain_hash and null experiment_hash. Some relevant information will be added to Retrain table as well. The model created newly will be added to Model with valid built_by_retrain (retrain_hash) and built_in_triage_run (run_id of TriageRun)

1. A `model_id` from a Triage model that you want to use to generate predictions
2. An `as_of_date` to generate your predictions on.
1. A `model_group_id` from a Triage model group that you want to use to retrain a model and generate prediction
2. A `today` to generate your predictions on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we changed the argument to prediction_date rather than today, didn't we?

Comment on lines +326 to +329
self.retrain_hash = save_retrain_and_get_hash(retrain_config, self.db_engine)

with get_for_update(self.db_engine, Retrain, self.retrain_hash) as retrain:
retrain.prediction_date = prediction_date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that we don't want the prediction date to be part of the retrain hash? But then if we re-run the same retrain and predict config with a different prediction date, won't we end up overwriting the date in the table by doing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prediction date is part of the retrain hash. The retrain_config has the prediction date and the retrain hash is generated by save_retrain_and_get_hash.

Comment on lines 214 to 215
self.training_label_timespan = self.experiment_config['temporal_config']['training_label_timespans'][0]
self.test_label_timespan = self.experiment_config['temporal_config']['test_label_timespans'][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to be determined with respect to the specific model group, not just the first entry in the experiment config (that might not be the one that was actually used for this model group). Likewise for the max_training_history

)

def get_temporal_config_for_retrain(self, prediction_date):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're only predicting for a specific date, should probably explicitly set the test_duration to 0day?

@shaycrk
Copy link
Contributor

shaycrk commented Jul 31, 2021

Thanks @tweddielin!

I took a look through the new commits and added a few inline comments about the temporal config, but that's definitely getting a lot closer.

A few overall thoughts on the database changes:

  • I think the plan was to remove the built_by_experiment column from the models table (and move this deprecated data to a separate table in order to avoid data loss in the migration) since this ends up being redundant with the link out to the run_id that built the model?
  • Likewise, I don't think we need to add a built_by_retrain column since that should also be available from joining to the runs table
  • In the runs table itself, I think it might be better to have a run_type_hash (or something similar) rather than separate experiment_hash and retrain_hash columns to provide more flexibility if we add other types down the road. I don't have strong feelings here, though, so curious on your thoughts (also maybe @thcrock, @ecsalomon, and @nanounanue have opinions?)
  • I would vote for going ahead and doing the rename from experiment_runs to triage_runs just to sort of rip the band-aid off all at once here, but curious if others have opinions on that as well?

@ecsalomon
Copy link
Contributor

Thoughts on things I was actually tagged for (as opposed to reviving long dead conversations):

  • I agree with @tweddielin that the renaming should be a separate PR. This one is already quite big
  • I think I agree with @shaycrk about the columns for runs. I also have a vague sense that we might not really need so many new categories/tables/columns in general -- i.e., maybe we drop the experiment label from triage runs/results. Are the attributes of runs, models, (matrices...,) etc. different across experiments and retrains (and potentially predicts without retraining)? Granted, I haven't read the code too in depth, so I could be missing something.

Comment on lines 22 to 23
op.add_column('experiment_runs', sa.Column('run_hash', sa.Text(), nullable=True), schema='triage_metadata')
op.drop_column('experiment_runs', 'experiment_hash', schema='triage_metadata')
Copy link
Contributor

@shaycrk shaycrk Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here should we just rename the experiment_hash column to run_hash to avoid losing any existing data in the experiment hash column during the migration? Likewise, would be good to set the run_type for any existing rows as well via an UPDATE statement:

op.execute("UPATE triage_metadata.experiment_runs SET run_type='experiment'")

schema='triage_metadata',
)

op.alter_column('models', 'built_in_experiment_run', nullable=False, new_column_name='built_in_triage_run', schema='triage_metadata')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add here (to avoid any potential for data loss):

op.execute("CREATE TABLE triage_metadata.deprecated_models_built_by_experiment AS SELECT model_id, model_hash, built_by_experiment FROM triage_metadata.models")

def downgrade():
op.execute("ALTER TABLE triage_metadata.triage_runs RENAME TO experiment_runs")
op.drop_column('experiment_runs', 'run_type', schema='triage_metadata')
op.drop_column('experiment_runs', 'run_hash', schema='triage_metadata')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, probably rename this back to experiment_hash rather than dropping the column

get_experiment_query = '''select experiments.config
from triage_metadata.triage_runs
join triage_metadata.models on (triage_runs.id = models.built_in_triage_run)
join triage_metadata.experiments on (experiments.experiment_hash = triage_runs.run_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably shouldn't expect collisions, but it might be good to be explicit here, e.g.

join triage_metadata.experiments on (experiments.experiment_hash = triage_runs.run_hash and triage_runs.run_type='experiment')

join triage_metadata.models
on (triage_runs.id = models.built_in_triage_run)
join triage_metadata.experiments
on (experiments.experiment_hash = triage_runs.run_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, might be good to add and triage_runs.run_type='experiment' to the join condition

query = """
SELECT re.config FROM triage_metadata.models m
LEFT JOIN triage_metadata.triage_runs r ON m.built_in_triage_run = r.id
LEFT JOIN triage_metadata.retrain re on re.retrain_hash = r.run_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, maybe add and triage_runs.run_type='retrain' to the join

@shaycrk
Copy link
Contributor

shaycrk commented Aug 23, 2021

Thanks @tweddielin -- saw you had added a couple of commits so I took a quick pass. It looks pretty good to me overall, just a couple of small inline things and then I think we should be in a good place to merge!

Copy link
Contributor

@shaycrk shaycrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and made the last few changes, which were pretty minor. I think the logic is generally in good shape and the tests are passing, so will go ahead and merge to master now.

@shaycrk shaycrk merged commit 537813a into master Aug 27, 2021
@shaycrk shaycrk deleted the list_making branch August 27, 2021 03:39
@tweddielin
Copy link
Contributor Author

@shaycrk sorry just saw this now!! but thanks for making those changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants