Skip to content

Conversation

@wagnerlmichael
Copy link
Member

@wagnerlmichael wagnerlmichael commented Jan 13, 2026

This PR adds two columns:

  • outlier_reason - an array column that contains all of the analyst and algorithmic outlier reasons.
  • source_is_outlier - A column that indicates whether or not we used the analyst override or the fallback statistical model for the is_outlier column. This column should make analytics comparing the two types of outliers sources easier, and also make it simpler to integrate the manual override integration into the model outliers report.

Sample query for inspection:

SELECT * FROM "z_ci_966_update_defaultvw_pin_sale_with_a_holistic_outlier_reason_field_default"."vw_pin_sale"
where is_arms_length is not null limit 50;

@wagnerlmichael wagnerlmichael changed the title WIP Add outlier_reason and source_is_outlier columns to default.vw_pin_sale Jan 15, 2026
OR flag_override.has_class_change IS NOT NULL
OR flag_override.has_characteristic_change IS NOT NULL
OR flag_override.requires_field_check IS NOT NULL
THEN 'analyst'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In is_outlier and source_is_outlier all of the null checking strikes me as perhaps a little brittle? One idea is we could add a test to make sure that there is at least one non null field in the columns that drive is_outlier in sale.flag_override

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] This is a really good point! I think tests are a good idea, although I also think we may be missing a layer of abstraction -- see my comment on the PR body for details. If we decide to move in the direction I suggest, we can talk about tests that would make sense for that data model.

@wagnerlmichael wagnerlmichael marked this pull request as ready for review January 16, 2026 16:56
@wagnerlmichael wagnerlmichael requested a review from a team as a code owner January 16, 2026 16:56
Copy link
Member

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your point about the null-checking logic being kind of brittle here is a very good one, and to me indicates a smell that we might be missing a layer of abstraction. What do you think about either a CTE or a view that is dedicated solely to combining our sales val algorithm with our analyst overrides, such that we can reduce duplication of brittle null-checking logic?

I'm imagining a schema like this:

# Algorithm fields
sv_is_outlier
sv_is_ptax_outlier
sv_is_heuristic_outlier
sv_outlier_reason1
sv_outlier_reason2
sv_outlier_reason3
sv_run_id
sv_version

# Analyst fields
ovrd_is_outlier
ovrd_is_arms_length
ovrd_is_flip
ovrd_has_class_change
ovrd_has_characteristic_change
ovrd_requires_field_check
ovrd_work_drawer

# Combined fields
has_sv_flag
has_flag_override

# Maybe we include this field, or maybe we leave it up to `default.vw_pin_sale`
is_outlier

I think the key changes here would be:

  • Clearly prefixing the override fields, per your + Tim's feedback
  • Encapsulating the logic that determines whether a sale has been flagged and/or reviewed in the has_sv_flag and has_flag_override columns, which would then allow downstream columns like is_outlier, source_is_outlier, and outlier_reason to more clearly define their conditionals based on the presence or absence of certain types of flagging, not the presence or absence of specific null fields

This logic could either live as a CTE in default.vw_pin_sale or as its own dedicated view in the sale database (e.g. sale.vw_flag). I'm agnostic as to which direction to go, though I'll note that one advantage of a view is that we would have the option of defining tests on it that should theoretically run quite fast, because each test would have to scan a lot less data than any given test on default.vw_pin_sale. An advantage of a CTE, on the other hand, is that it's one less file a reader has to skip to when they're trying to understand the structure of default.vw_pin_sale.

Curious what you think about all this! I've kept my review pretty high-level because I think this will be a sizeable change if we decide to undertake it.

OR flag_override.has_class_change IS NOT NULL
OR flag_override.has_characteristic_change IS NOT NULL
OR flag_override.requires_field_check IS NOT NULL
THEN 'analyst'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] This is a really good point! I think tests are a good idea, although I also think we may be missing a layer of abstraction -- see my comment on the PR body for details. If we decide to move in the direction I suggest, we can talk about tests that would make sense for that data model.

@TimCookCountyDS
Copy link

Quick question re:

# Algorithm fields
sv_is_outlier
sv_is_ptax_outlier
sv_is_heuristic_outlier
sv_outlier_reason1
sv_outlier_reason2
sv_outlier_reason3
sv_run_id
sv_version

# Analyst fields
ovrd_is_outlier
ovrd_is_arms_length
ovrd_is_flip
ovrd_has_class_change
ovrd_has_characteristic_change
ovrd_requires_field_check
ovrd_work_drawer

# Combined fields
has_sv_flag
has_flag_override

# Maybe we include this field, or maybe we leave it up to `default.vw_pin_sale`
is_outlier

In this case would the "combined final judgment (our human review + sv_algorithm flagged)" as to whether something is an outlier be reflected by the "is_outlier" column?

(Also, I would think a view rather than CTE might be a little more helpful - in terms of having this mapped out clearly + s, for the purposes of later comparing any other outlier-flagging approaches or analysis that we want to do- will defer to your guys judgment though.)

@jeancochrane
Copy link
Member

In this case would the "combined final judgment (our human review + sv_algorithm flagged)" as to whether something is an outlier be reflected by the "is_outlier" column?

Yup! Though I'm uncertain whether it makes sense for that logic to live in this proposed view or to continue to live in default.vw_pin_sale. I lean toward default.vw_pin_sale for two reasons (though my opinion is very weakly held):

  1. The determination would rely on the logic that has_sv_flag and has_flag_override encapsulates, so the view would require an additional subquery in order to aggregate those columns to up to is_outlier; this is technically feasible but makes the view slightly more complex
  2. It strikes me as plausible that at some point we may want to change the final is_outlier logic so that it considers the state of data outside sale.flag and sale.flag_override. If we were to do this, then suddenly the final is_outlier logic would be distinct from the is_outlier logic in the view, which means we would have yet another *_is_outlier column to maintain with its own distinct meaning.

@TimCookCountyDS
Copy link

In this case would the "combined final judgment (our human review + sv_algorithm flagged)" as to whether something is an outlier be reflected by the "is_outlier" column?

Yup! Though I'm uncertain whether it makes sense for that logic to live in this proposed view or to continue to live in default.vw_pin_sale. I lean toward default.vw_pin_sale for two reasons (though my opinion is very weakly held):

  1. The determination would rely on the logic that has_sv_flag and has_flag_override encapsulates, so the view would require an additional subquery in order to aggregate those columns to up to is_outlier; this is technically feasible but makes the view slightly more complex
  2. It strikes me as plausible that at some point we may want to change the final is_outlier logic so that it considers the state of data outside sale.flag and sale.flag_override. If we were to do this, then suddenly the final is_outlier logic would be distinct from the is_outlier logic in the view, which means we would have yet another *_is_outlier column to maintain with its own distinct meaning.

These both make sense to me- especially point number 2, with regard to future scaling/experiments.

@wagnerlmichael
Copy link
Member Author

Thanks all for all of the thoughts. I'm gonna move forward with the view!

@jeancochrane
Copy link
Member

@wagnerlmichael I think this PR is superceded by #977, right?

@wagnerlmichael
Copy link
Member Author

@wagnerlmichael I think this PR is superceded by #977, right?

Yep, I think this one can be closed out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update default.vw_pin_sale with a holistic outlier_reason field

3 participants