Integrate new sale `outlier_reason` column into ingest and analysis #423

jeancochrane · 2026-01-24T00:23:14Z

This PR builds off of #417, migrating a few more spots of the code from the legacy sv_is_outlier column to pull from default.vw_pin_sale.is_outlier, and switching to the new outlier_reason column that we are adding to default.vw_pin_sale in ccao-data/data-architecture#977.

The one change that I know this PR neglects to make is updating the reports/performance/_outliers.qmd report. That report is going to need substantial refactor in order to use the new data model, and I want to leave it up to Michael to own that.

wagnerlmichael

This looks good, I think a model run could be good to make sure this isn't breaking the pipeline.

When ccao-data/data-architecture#977 lands, I can get started on the report refactor.

wagnerlmichael · 2026-01-26T21:41:42Z

pipeline/00-ingest.R

-      sale.sv_outlier_reason3,
-      sale.sv_run_id,
+      sale.outlier_reason,
+      sale.flag_run_id,


[Thought]:
I think overall we lose a little bit of information here.

In the prior version we had all of the outlier reasons. We also had the run_id for each flag so that if needed, we would be able to back out the source of the flag determination.

I'm trying to think how we would back out the source of a Review based outlier. Previously we put some amount of focus on the idea that we could back out anything out to source. If we have an excel book that holds the source data for a review determination, then we run the model, is the best way to think about that to look in the excel workbook data or the sale.flag_review table?

Would the way to tell if the outlier source came from review be looking at the prefix from outlier_reason?

I'm not 100% sure this is high priority enough to be thinking this carefully, but it is something that I wanted to flag.

No, I think these are excellent questions! Thanks for thinking hard about this.

Would the way to tell if the outlier source came from review be looking at the prefix from outlier_reason?

That's what I'm imagining -- if the outlier reason starts with "Algorithm", then the sales validation model was the determining factor; if it starts with "Review", then sale review was the determining factor. Not particularly friendly for machines to parse (though splitting on the colon should be fine) but I lean towards thinking that a dedicated column like outlier_source does not have a concrete use such that it would be worth the squeeze just yet. Definitely open to pushback on that, though.

We also had the run_id for each flag so that if needed, we would be able to back out the source of the flag determination.

We still have the run ID, so we should still be able to back out the algorithmic flags, at least. However, I think you're right that there's no exact way to tie a given sale in the training data to the sale_review flags that we used to produce our final is_outlier decision, since review flags don't have an exact equivalent of "run ID". I suspect the review flags for a given doc no will probably not change all that much over time, though I wouldn't go so far as to bet they will never change -- if we had confidence they would never change, after all, we wouldn't need any more info, since the doc number would uniquely identify the review flags across all time. But I think it's worth being defensive by assuming that it's possible those flags may change in the future.

So far I can think of two options, neither of which strikes me as perfect:

Use sale.flag_review.source_file as the review equivalent of a "run ID", and bring it into default.vw_pin_sale and the ingest training data under a column name like review_source_file. I'm more confident assuming that the _combination of doc_no/source_file in sale.vw_outlier will stay stable over time. This approach would have the nice property of only requiring us to import one additional field, and having that field mirror flag_run_id pretty closely. However, it tightly couples us to the current Excel-based review process, which we hope to refactor to an iasWorld-based review process sometime this year -- the notion of a "source file" probably won't make much sense if we ever get around to moving review into iasWorld, which means we'd have to refactor this whole pipeline if that happens (and we hope it will).

Duplicate all of the flag columns from sale.flag_review in the ingest so that we have a copy of them if they ever change. This option avoids tightly coupling to the Excel solution like we saw in option 1, and it is robust to additional rounds of review that may change flags. But it means adding (at least) four new columns to the training data that we have to support, hence adding more complexity to an already very complex view. It also means that if we ever add new information that we care about to the review process in the future, we'll have to keep adding new columns, without the ability to delete the old one -- in that way, the data model is also somewhat inflexible.

I'm going to ruminate on this tonight to see if I can come up with something better! To me, the key is to find an option that will work well for both the Excel-based review process and a future iasWorld-based review process, but that won't require adding too much complexity, and that we can easily accomplish in the next day or two. It's hard to get all of those things right at the same time, but it would be nice if we didn't have to sacrifice one for the other.

… review history

… data lake

jeancochrane · 2026-01-27T23:36:58Z

pipeline/00-ingest.R

+      CASE
+        WHEN sale.has_review
+          THEN review_json
+      END AS sv_review_json,


Casting this to null in case the sale is missing review, which I think is more intuitive and more closely matches how the sv_outlier_reason{N} columns behave.

jeancochrane · 2026-01-27T23:38:06Z

@wagnerlmichael I made a few changes here in alignment with ccao-data/data-architecture#977. The basic gist is that I'm reducing the number of changes we're making to this data model to preserve backwards-compatibility as much as possible; the only new columns should be sv_review_json and sv_outlier_reason. Let me know what you think!

jeancochrane · 2026-01-28T17:57:05Z

pipeline/00-ingest.R

+# lagged features.
+#
+# Many of these columns come from our sales validation workflow. Some notes
+# about these columns:
+#
+# - `sv_is_outlier` combines our algorithmic sales validation flags with
+#   findings from human review to produce a final boolean outlier decision.
+#   It is never null, even when a sale has not been flagged or reviewed.
+#
+# - `sv_outlier_reason` is a verbose explanation of the decision behind
+#   `sv_is_outlier`. If a sale has not been flagged or reviewed, it will be
+#   null.
+#
+# - `sv_outlier_reason{N}`, where 1 <= N <= 3, are three columns that we query
+#   directly from the output of our algorithmic sales validation pipeline,
+#   recording the reasons why the pipeline flagged the sale as an outlier.
+#   The names of these columns are somewhat confusing given their similarity to
+#   the `sv_outlier_reason` column, but they are important for backwards
+#   compatibility. Since `sv_is_outlier` factors in findings from reviewers in
+#   addition to this algorithmic pipeline, this field may not match up
+#   intuitively with the boolean decision in `sv_is_outlier`.
+#
+# - `sv_review_json` is a JSON string storing the raw findings from sale review.
+#   Its schema is not guaranteed to have a consistent structure, so it's not
+#   useful for serious analysis or programming tasks. However, since the field
+#   is intended to act only as an archival record of the exact state of the
+#   review findings  at the point in time of a given model run, we think it is
+#   preferable for the field to have a simple-but-maintainable format instead of
+#   a sophisticated-but-complex format. (This kind of archival record is
+#   important because it is theoretically possible that a sale may get
+#   re-reviewed between model runs, in which case the review tables in our
+#   data lake would not exactly match the review data that existed at the
+#   time of a historical model run.)


In addition to these docs, I've also documented the sales val fields on model.training_data in the data catalog for ease of reference in ccao-data/data-architecture#977.

jeancochrane · 2026-01-28T18:04:45Z

This is ready for a final look @wagnerlmichael! Thinking about our IRL discussion more, I actually think it would be quite useful for you to branch off of my branch and try to run the ingest (after switching out the table reference to z_ci_jeancochrane_fixup_is_outlier_default.vw_pin_sale). I haven't actually tested this, so it would be helpful to shake out any issues with this PR before we merge it.

jeancochrane · 2026-02-03T22:08:09Z

Superseded by #427.

wagnerlmichael and others added 5 commits January 8, 2026 20:53

Switch refs

6152af3

Change refs

c88b8dd

Add stats change

3aef3d3

Migrate a couple more spots to is_outlier and fix up outlier_reason

f158034

Switch from sale_filter_is_outlier to is_outlier

20abb7c

jeancochrane mentioned this pull request Jan 24, 2026

Refactor outlier columns in default.vw_pin_sale to incorporate human review ccao-data/data-architecture#977

Merged

Add missing comma to ingest query

2692ce5

jeancochrane requested a review from wagnerlmichael January 24, 2026 02:19

jeancochrane marked this pull request as ready for review January 24, 2026 02:20

jeancochrane requested a review from wrridgeway as a code owner January 24, 2026 02:20

wagnerlmichael approved these changes Jan 26, 2026

View reviewed changes

jeancochrane changed the base branch from 414-incorporate-sales-val-analyst-overrides-into-the-model to master January 27, 2026 21:55

jeancochrane added 2 commits January 27, 2026 16:53

Switch up sales val data model for better backwards-compatibility and…

8c9d9a5

… review history

Tweak sv_review_json field in training data to match new structure in…

165a91f

… data lake

jeancochrane commented Jan 27, 2026

View reviewed changes

jeancochrane requested a review from wagnerlmichael January 27, 2026 23:38

jeancochrane changed the title ~~Integrate new outlier_reason column into ingest and analysis~~ Integrate new sale outlier_reason column into ingest and analysis Jan 28, 2026

jeancochrane commented Jan 28, 2026

View reviewed changes

Merge branch 'master' into jeancochrane/fixup-is-outlier

99c049e

jeancochrane mentioned this pull request Jan 29, 2026

[WIP] Integrate outlier sale overrides into the model #417

Closed

jeancochrane closed this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate new sale `outlier_reason` column into ingest and analysis #423

Integrate new sale `outlier_reason` column into ingest and analysis #423

Uh oh!

jeancochrane commented Jan 24, 2026 •

edited

Loading

Uh oh!

wagnerlmichael left a comment

Uh oh!

wagnerlmichael Jan 26, 2026

Uh oh!

jeancochrane Jan 27, 2026 •

edited

Loading

Uh oh!

jeancochrane Jan 27, 2026

Uh oh!

jeancochrane commented Jan 27, 2026

Uh oh!

jeancochrane Jan 28, 2026

Uh oh!

jeancochrane commented Jan 28, 2026 •

edited

Loading

Uh oh!

jeancochrane commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate new sale outlier_reason column into ingest and analysis #423

Integrate new sale outlier_reason column into ingest and analysis #423

Uh oh!

Conversation

jeancochrane commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wagnerlmichael left a comment

Choose a reason for hiding this comment

Uh oh!

wagnerlmichael Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

jeancochrane Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeancochrane Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

jeancochrane commented Jan 27, 2026

Uh oh!

jeancochrane Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

jeancochrane commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeancochrane commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate new sale `outlier_reason` column into ingest and analysis #423

Integrate new sale `outlier_reason` column into ingest and analysis #423

jeancochrane commented Jan 24, 2026 •

edited

Loading

jeancochrane Jan 27, 2026 •

edited

Loading

jeancochrane commented Jan 28, 2026 •

edited

Loading