Skip to content

Conversation

@jeancochrane
Copy link
Member

@jeancochrane jeancochrane commented Jan 24, 2026

This PR builds off of #417, migrating a few more spots of the code from the legacy sv_is_outlier column to pull from default.vw_pin_sale.is_outlier, and switching to the new outlier_reason column that we are adding to default.vw_pin_sale in ccao-data/data-architecture#977.

The one change that I know this PR neglects to make is updating the reports/performance/_outliers.qmd report. That report is going to need substantial refactor in order to use the new data model, and I want to leave it up to Michael to own that.

@jeancochrane jeancochrane marked this pull request as ready for review January 24, 2026 02:20
Copy link
Member

@wagnerlmichael wagnerlmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I think a model run could be good to make sure this isn't breaking the pipeline.

When ccao-data/data-architecture#977 lands, I can get started on the report refactor.

sale.sv_outlier_reason3,
sale.sv_run_id,
sale.outlier_reason,
sale.flag_run_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Thought]:
I think overall we lose a little bit of information here.

In the prior version we had all of the outlier reasons. We also had the run_id for each flag so that if needed, we would be able to back out the source of the flag determination.

I'm trying to think how we would back out the source of a Review based outlier. Previously we put some amount of focus on the idea that we could back out anything out to source. If we have an excel book that holds the source data for a review determination, then we run the model, is the best way to think about that to look in the excel workbook data or the sale.flag_review table?

Would the way to tell if the outlier source came from review be looking at the prefix from outlier_reason?

I'm not 100% sure this is high priority enough to be thinking this carefully, but it is something that I wanted to flag.

Copy link
Member Author

@jeancochrane jeancochrane Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think these are excellent questions! Thanks for thinking hard about this.

Would the way to tell if the outlier source came from review be looking at the prefix from outlier_reason?

That's what I'm imagining -- if the outlier reason starts with "Algorithm", then the sales validation model was the determining factor; if it starts with "Review", then sale review was the determining factor. Not particularly friendly for machines to parse (though splitting on the colon should be fine) but I lean towards thinking that a dedicated column like outlier_source does not have a concrete use such that it would be worth the squeeze just yet. Definitely open to pushback on that, though.

We also had the run_id for each flag so that if needed, we would be able to back out the source of the flag determination.

We still have the run ID, so we should still be able to back out the algorithmic flags, at least. However, I think you're right that there's no exact way to tie a given sale in the training data to the sale_review flags that we used to produce our final is_outlier decision, since review flags don't have an exact equivalent of "run ID". I suspect the review flags for a given doc no will probably not change all that much over time, though I wouldn't go so far as to bet they will never change -- if we had confidence they would never change, after all, we wouldn't need any more info, since the doc number would uniquely identify the review flags across all time. But I think it's worth being defensive by assuming that it's possible those flags may change in the future.

So far I can think of two options, neither of which strikes me as perfect:

  1. Use sale.flag_review.source_file as the review equivalent of a "run ID", and bring it into default.vw_pin_sale and the ingest training data under a column name like review_source_file. I'm more confident assuming that the _combination of doc_no/source_file in sale.vw_outlier will stay stable over time. This approach would have the nice property of only requiring us to import one additional field, and having that field mirror flag_run_id pretty closely. However, it tightly couples us to the current Excel-based review process, which we hope to refactor to an iasWorld-based review process sometime this year -- the notion of a "source file" probably won't make much sense if we ever get around to moving review into iasWorld, which means we'd have to refactor this whole pipeline if that happens (and we hope it will).
  2. Duplicate all of the flag columns from sale.flag_review in the ingest so that we have a copy of them if they ever change. This option avoids tightly coupling to the Excel solution like we saw in option 1, and it is robust to additional rounds of review that may change flags. But it means adding (at least) four new columns to the training data that we have to support, hence adding more complexity to an already very complex view. It also means that if we ever add new information that we care about to the review process in the future, we'll have to keep adding new columns, without the ability to delete the old one -- in that way, the data model is also somewhat inflexible.

I'm going to ruminate on this tonight to see if I can come up with something better! To me, the key is to find an option that will work well for both the Excel-based review process and a future iasWorld-based review process, but that won't require adding too much complexity, and that we can easily accomplish in the next day or two. It's hard to get all of those things right at the same time, but it would be nice if we didn't have to sacrifice one for the other.

@jeancochrane jeancochrane changed the base branch from 414-incorporate-sales-val-analyst-overrides-into-the-model to master January 27, 2026 21:55
Comment on lines +137 to +140
CASE
WHEN sale.has_review
THEN review_json
END AS sv_review_json,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting this to null in case the sale is missing review, which I think is more intuitive and more closely matches how the sv_outlier_reason{N} columns behave.

@jeancochrane
Copy link
Member Author

@wagnerlmichael I made a few changes here in alignment with ccao-data/data-architecture#977. The basic gist is that I'm reducing the number of changes we're making to this data model to preserve backwards-compatibility as much as possible; the only new columns should be sv_review_json and sv_outlier_reason. Let me know what you think!

@jeancochrane jeancochrane changed the title Integrate new outlier_reason column into ingest and analysis Integrate new sale outlier_reason column into ingest and analysis Jan 28, 2026
Comment on lines +88 to +120
# lagged features.
#
# Many of these columns come from our sales validation workflow. Some notes
# about these columns:
#
# - `sv_is_outlier` combines our algorithmic sales validation flags with
# findings from human review to produce a final boolean outlier decision.
# It is never null, even when a sale has not been flagged or reviewed.
#
# - `sv_outlier_reason` is a verbose explanation of the decision behind
# `sv_is_outlier`. If a sale has not been flagged or reviewed, it will be
# null.
#
# - `sv_outlier_reason{N}`, where 1 <= N <= 3, are three columns that we query
# directly from the output of our algorithmic sales validation pipeline,
# recording the reasons why the pipeline flagged the sale as an outlier.
# The names of these columns are somewhat confusing given their similarity to
# the `sv_outlier_reason` column, but they are important for backwards
# compatibility. Since `sv_is_outlier` factors in findings from reviewers in
# addition to this algorithmic pipeline, this field may not match up
# intuitively with the boolean decision in `sv_is_outlier`.
#
# - `sv_review_json` is a JSON string storing the raw findings from sale review.
# Its schema is not guaranteed to have a consistent structure, so it's not
# useful for serious analysis or programming tasks. However, since the field
# is intended to act only as an archival record of the exact state of the
# review findings at the point in time of a given model run, we think it is
# preferable for the field to have a simple-but-maintainable format instead of
# a sophisticated-but-complex format. (This kind of archival record is
# important because it is theoretically possible that a sale may get
# re-reviewed between model runs, in which case the review tables in our
# data lake would not exactly match the review data that existed at the
# time of a historical model run.)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to these docs, I've also documented the sales val fields on model.training_data in the data catalog for ease of reference in ccao-data/data-architecture#977.

@jeancochrane
Copy link
Member Author

jeancochrane commented Jan 28, 2026

This is ready for a final look @wagnerlmichael! Thinking about our IRL discussion more, I actually think it would be quite useful for you to branch off of my branch and try to run the ingest (after switching out the table reference to z_ci_jeancochrane_fixup_is_outlier_default.vw_pin_sale). I haven't actually tested this, so it would be helpful to shake out any issues with this PR before we merge it.

@jeancochrane
Copy link
Member Author

Superseded by #427.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants