-
Notifications
You must be signed in to change notification settings - Fork 15
Integrate new sale outlier_reason column into ingest and analysis
#423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wagnerlmichael
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I think a model run could be good to make sure this isn't breaking the pipeline.
When ccao-data/data-architecture#977 lands, I can get started on the report refactor.
pipeline/00-ingest.R
Outdated
| sale.sv_outlier_reason3, | ||
| sale.sv_run_id, | ||
| sale.outlier_reason, | ||
| sale.flag_run_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Thought]:
I think overall we lose a little bit of information here.
In the prior version we had all of the outlier reasons. We also had the run_id for each flag so that if needed, we would be able to back out the source of the flag determination.
I'm trying to think how we would back out the source of a Review based outlier. Previously we put some amount of focus on the idea that we could back out anything out to source. If we have an excel book that holds the source data for a review determination, then we run the model, is the best way to think about that to look in the excel workbook data or the sale.flag_review table?
Would the way to tell if the outlier source came from review be looking at the prefix from outlier_reason?
I'm not 100% sure this is high priority enough to be thinking this carefully, but it is something that I wanted to flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think these are excellent questions! Thanks for thinking hard about this.
Would the way to tell if the outlier source came from review be looking at the prefix from
outlier_reason?
That's what I'm imagining -- if the outlier reason starts with "Algorithm", then the sales validation model was the determining factor; if it starts with "Review", then sale review was the determining factor. Not particularly friendly for machines to parse (though splitting on the colon should be fine) but I lean towards thinking that a dedicated column like outlier_source does not have a concrete use such that it would be worth the squeeze just yet. Definitely open to pushback on that, though.
We also had the
run_idfor each flag so that if needed, we would be able to back out the source of the flag determination.
We still have the run ID, so we should still be able to back out the algorithmic flags, at least. However, I think you're right that there's no exact way to tie a given sale in the training data to the sale_review flags that we used to produce our final is_outlier decision, since review flags don't have an exact equivalent of "run ID". I suspect the review flags for a given doc no will probably not change all that much over time, though I wouldn't go so far as to bet they will never change -- if we had confidence they would never change, after all, we wouldn't need any more info, since the doc number would uniquely identify the review flags across all time. But I think it's worth being defensive by assuming that it's possible those flags may change in the future.
So far I can think of two options, neither of which strikes me as perfect:
- Use
sale.flag_review.source_fileas the review equivalent of a "run ID", and bring it intodefault.vw_pin_saleand the ingest training data under a column name likereview_source_file. I'm more confident assuming that the _combination ofdoc_no/source_fileinsale.vw_outlierwill stay stable over time. This approach would have the nice property of only requiring us to import one additional field, and having that field mirrorflag_run_idpretty closely. However, it tightly couples us to the current Excel-based review process, which we hope to refactor to an iasWorld-based review process sometime this year -- the notion of a "source file" probably won't make much sense if we ever get around to moving review into iasWorld, which means we'd have to refactor this whole pipeline if that happens (and we hope it will). - Duplicate all of the flag columns from
sale.flag_reviewin the ingest so that we have a copy of them if they ever change. This option avoids tightly coupling to the Excel solution like we saw in option 1, and it is robust to additional rounds of review that may change flags. But it means adding (at least) four new columns to the training data that we have to support, hence adding more complexity to an already very complex view. It also means that if we ever add new information that we care about to the review process in the future, we'll have to keep adding new columns, without the ability to delete the old one -- in that way, the data model is also somewhat inflexible.
I'm going to ruminate on this tonight to see if I can come up with something better! To me, the key is to find an option that will work well for both the Excel-based review process and a future iasWorld-based review process, but that won't require adding too much complexity, and that we can easily accomplish in the next day or two. It's hard to get all of those things right at the same time, but it would be nice if we didn't have to sacrifice one for the other.
| CASE | ||
| WHEN sale.has_review | ||
| THEN review_json | ||
| END AS sv_review_json, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting this to null in case the sale is missing review, which I think is more intuitive and more closely matches how the sv_outlier_reason{N} columns behave.
|
@wagnerlmichael I made a few changes here in alignment with ccao-data/data-architecture#977. The basic gist is that I'm reducing the number of changes we're making to this data model to preserve backwards-compatibility as much as possible; the only new columns should be |
outlier_reason column into ingest and analysisoutlier_reason column into ingest and analysis
| # lagged features. | ||
| # | ||
| # Many of these columns come from our sales validation workflow. Some notes | ||
| # about these columns: | ||
| # | ||
| # - `sv_is_outlier` combines our algorithmic sales validation flags with | ||
| # findings from human review to produce a final boolean outlier decision. | ||
| # It is never null, even when a sale has not been flagged or reviewed. | ||
| # | ||
| # - `sv_outlier_reason` is a verbose explanation of the decision behind | ||
| # `sv_is_outlier`. If a sale has not been flagged or reviewed, it will be | ||
| # null. | ||
| # | ||
| # - `sv_outlier_reason{N}`, where 1 <= N <= 3, are three columns that we query | ||
| # directly from the output of our algorithmic sales validation pipeline, | ||
| # recording the reasons why the pipeline flagged the sale as an outlier. | ||
| # The names of these columns are somewhat confusing given their similarity to | ||
| # the `sv_outlier_reason` column, but they are important for backwards | ||
| # compatibility. Since `sv_is_outlier` factors in findings from reviewers in | ||
| # addition to this algorithmic pipeline, this field may not match up | ||
| # intuitively with the boolean decision in `sv_is_outlier`. | ||
| # | ||
| # - `sv_review_json` is a JSON string storing the raw findings from sale review. | ||
| # Its schema is not guaranteed to have a consistent structure, so it's not | ||
| # useful for serious analysis or programming tasks. However, since the field | ||
| # is intended to act only as an archival record of the exact state of the | ||
| # review findings at the point in time of a given model run, we think it is | ||
| # preferable for the field to have a simple-but-maintainable format instead of | ||
| # a sophisticated-but-complex format. (This kind of archival record is | ||
| # important because it is theoretically possible that a sale may get | ||
| # re-reviewed between model runs, in which case the review tables in our | ||
| # data lake would not exactly match the review data that existed at the | ||
| # time of a historical model run.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to these docs, I've also documented the sales val fields on model.training_data in the data catalog for ease of reference in ccao-data/data-architecture#977.
|
This is ready for a final look @wagnerlmichael! Thinking about our IRL discussion more, I actually think it would be quite useful for you to branch off of my branch and try to run the ingest (after switching out the table reference to |
|
Superseded by #427. |
This PR builds off of #417, migrating a few more spots of the code from the legacy
sv_is_outliercolumn to pull fromdefault.vw_pin_sale.is_outlier, and switching to the newoutlier_reasoncolumn that we are adding todefault.vw_pin_salein ccao-data/data-architecture#977.The one change that I know this PR neglects to make is updating the
reports/performance/_outliers.qmdreport. That report is going to need substantial refactor in order to use the new data model, and I want to leave it up to Michael to own that.