Skip to content

Conversation

@jeancochrane
Copy link
Member

@jeancochrane jeancochrane commented Jan 23, 2026

Overview

This PR builds off of #967, suggesting a refactored data model that factors out the logic that combines our algorithmic sale flags with information from our human reviewers in order to produce a final outlier determination with clear reasons.

I went a little bit further with the data model than I expected -- I hope it's not too confusing! Though big refactors can be risky, this one feels appropriate because this data model is still very new. I'd be happy to walk through the changes on a call if it's too much to review from scratch.

See ccao-data/model-res-avm#423 for the corresponding res model PR.

Data model changes

New models that this PR introduces:

  • sale.vw_flag: View that pulls the most recent version of each algorithmically flagged sale from sale.flag
  • sale.vw_outlier: View that combines algorithmic sale flags with human-reviewed sale attributes in order to produce a final outlier determination and corresponding reasons

Changes to existing models:

  • Renamed sale.flag_override to sale.flag_review for clarity (we are not directly using the information to "override" sales val flags, we are just incorporating that information into our final decision, so "review" seems more neutral)
  • Added new column default.vw_pin_sale.is_outlier reflecting the final decision on outlier status based on sale.vw_outlier (which pulls from sale.flag and sale.flag_review)
  • Added new column default.vw_pin_sale.outlier_reason reflecting a human-readable string with the reason behind the sale's outlier status (also pulled from sale.vw_outlier)
  • Renamed audit trail columns in default.vw_pin_sale that come directly from sale.flag and sale.flag_review to use the prefixes flag_* and review_*, to make the provenance of each column more obvious, e.g. is_arms_length ➡️ review_is_arms_length
    • Thanks to Michael and Tim for this idea!

Open questions

  • Should we update default.vw_pin_sale_combined? I haven't done so yet because I'm not sure of the status of it. Are we still using it? I almost wonder if we should remove that view to reduce complexity, now that we are getting new sales on a more regular schedule.

@jeancochrane jeancochrane changed the base branch from 966-update-defaultvw_pin_sale-with-a-holistic-outlier_reason-field to master January 24, 2026 00:38
@jeancochrane jeancochrane marked this pull request as ready for review January 24, 2026 02:19
Copy link
Member

@wagnerlmichael wagnerlmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I left a few comments and questions. Many of them are me thinking out loud about outlier_reason.

Regarding the DBT unit tests, I saw you linked the docs, but I still think it would be useful to connect out of band to do a bit of knowledge sharing.

Regarding the default.vw_pin_sale_combined, I think that is a good question. I'm curious to see what Billy and Nicole would think. I'm also not exactly sure what conversations have been like about the regular sale ingest cadence.

I took a look at how much is_outlier disagrees with flag_is_outlier for sales where has_review = True:

SELECT
    is_outlier,
    flag_is_outlier,
    COUNT(*) AS row_count
FROM "z_ci_jeancochrane_fixup_is_outlier_sale"."vw_outlier"
WHERE has_review = true
GROUP BY
    is_outlier,
    flag_is_outlier
ORDER BY
    is_outlier,
    flag_is_outlier;

There is some disagreement (10% ish), but so far on the total number of outliers is almost unchanged

Comment on lines 79 to 91
-- for the market, or if a non-arm's-length sale is close
-- to market price, then the information from that sale is
-- still useful for our valuation models
WHEN has_flag AND flag_is_outlier
THEN
CASE
WHEN review_is_flip
THEN
'Review: Flip'
WHEN NOT review_is_arms_length
THEN
'Review: Non-Arms-Length'
ELSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question]: Could it make sense to switch these to something like "Review: Flip, Algorithm: $algorithm_reason" ? Since those are the operative conditions that produce the outlier?

Perhaps that would be confusing for a down stream data user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's an interesting idea, I like it. Since the only really relevant part of the algorithmic process (currently) is price, maybe we do something like Review: Flip, Algorithm: High Price? Or something like Review + Algorithm: High Price Flip?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent maybe a little too much time trying to think about which one of these is better and I'm still pretty split. I think the second one is more concise and tells a bit more of a story.

The first one however is a bit more modular, and perhaps in the future its structure lends itself to more easily incorporate classifications that depend on a mix of both review and algorithm reasons. I'm good with either one!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one however is a bit more modular, and perhaps in the future its structure lends itself to more easily incorporate classifications that depend on a mix of both review and algorithm reasons.

Yeah, I was thinking that too! I think modularity is more important than concision in this case, so I'll move forward with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b6562bc.

@jeancochrane
Copy link
Member Author

jeancochrane commented Jan 27, 2026

@wagnerlmichael I took a stab at another refactor in 4b0f4fe to reorient the data model for better support for archiving flag/review state in the res and condo models (see ccao-data/model-res-avm#423). Key changes include:

  • Switching back to flag_outlier_reason{N} from the proposed array field flag_outlier_reasons
    • My reasoning here is that downstream consumers are already using the *_outlier_reason{N} schema, and while it doesn't feel ideal to me, there's not any need to change it right now when we might as well keep it as-is and preserve backwards-compatibility -- however, if you disagree and you feel strongly that we should switch to an array structure, let me know and we can reconsider
  • Adding a new review_json field that is a JSON object storing the raw state of the review findings
  • Fixing docs to reflect these changes

Take a look and let me know what you think!

Copy link
Member

@wagnerlmichael wagnerlmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, awesome work on this new view! I think it is going to be nice to work with. One small confirmation where a comment might be helpful, but I don't feel particularly strongly about it.

Comment on lines +222 to +223
OR outlier_reason LIKE 'Review: Non-Arms-Length%'
OR outlier_reason LIKE 'Review: Flip%'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although a "Non-arms-length" value doesn't necessarily lead to an outlier, this string match works because outlier_reason only contains Review: Non-Arms-Length if it was properly paired with the price outlier and therefore determined to be an outlier. Is that right?

Copy link
Member Author

@jeancochrane jeancochrane Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! It felt a little bit risky to document this reasoning at this point in the query, since I can easily imagine us tweaking the logic up above in the outlier_reason CTE, e.g. to decide to expand the types of algorithmic flags that would determine an outlier for a flip/non-arms-length sale, while forgetting to update the explanatory comment down here. As a compromise, I beefed up the comment in this section to point readers to the comments on the outlier_reason CTE for details on why each reason is or is not an outlier in af0dcfc.

@jeancochrane
Copy link
Member Author

@wagnerlmichael This should be ready for a final round of review! My commits today starting with e0da4d9 are exclusively dedicated to cleaning up docs to reflect the final data model.

@wagnerlmichael
Copy link
Member

@wagnerlmichael This should be ready for a final round of review! My commits today starting with e0da4d9 are exclusively dedicated to cleaning up docs to reflect the final data model.

This looks good to me! Thanks for beefing up the docs.

@jeancochrane
Copy link
Member Author

I forgot about the Core Team review requirement here, so we'll need @wrridgeway to take a quick look at this before we merge to make sure we didn't do anything obviously bad.

Copy link
Member

@wrridgeway wrridgeway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my end everything looks pretty solid here, I can't find anything that sticks out as wrong. I'm assuming all the unit tests for sale.vw_outlier would catch any unexpected results produced by the conditional logic.

@jeancochrane jeancochrane merged commit ec33672 into master Feb 3, 2026
8 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/fixup-is-outlier branch February 3, 2026 21:12
jeancochrane added a commit to ccao-data/model-res-avm that referenced this pull request Feb 3, 2026
…and simplify (#427)

This PR reworks the performance report to work with out [new sales val
data model
additions](ccao-data/data-architecture#977) and
add two things to the report:
- outlier numbers (raw and proportion) per year
- outlier proportion maps per nbhd incorporated from our geography group
testing

It also reorganizes the outlier reasons that are displayed

---------

Co-authored-by: Jean Cochrane <jean@jeancochrane.com>
Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
@@ -66,14 +66,14 @@ SELECT
vps.sale_price,
vps.sale_date,
vps.sale_filter_is_outlier,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane should this become is_outlier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sale_filter_is_outlier is now just an alias for is_outlier, in order to support backwards-compatibility for downstream consumers that are still relying on sale_filter_is_outlier (like the market tracker, I believe). It would probably be helpful in the long term to switch downstream consumers to is_outlier and remove this legacy field, but I don't think it's an urgent task.

COALESCE(outlier.is_outlier, FALSE) AS sale_filter_is_outlier,

COALESCE(outlier.is_outlier, FALSE) AS is_outlier,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants