-
Notifications
You must be signed in to change notification settings - Fork 5
Refactor outlier columns in default.vw_pin_sale to incorporate human review
#977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e` to `sale.vw_outlier`
wagnerlmichael
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I left a few comments and questions. Many of them are me thinking out loud about outlier_reason.
Regarding the DBT unit tests, I saw you linked the docs, but I still think it would be useful to connect out of band to do a bit of knowledge sharing.
Regarding the default.vw_pin_sale_combined, I think that is a good question. I'm curious to see what Billy and Nicole would think. I'm also not exactly sure what conversations have been like about the regular sale ingest cadence.
I took a look at how much is_outlier disagrees with flag_is_outlier for sales where has_review = True:
SELECT
is_outlier,
flag_is_outlier,
COUNT(*) AS row_count
FROM "z_ci_jeancochrane_fixup_is_outlier_sale"."vw_outlier"
WHERE has_review = true
GROUP BY
is_outlier,
flag_is_outlier
ORDER BY
is_outlier,
flag_is_outlier;There is some disagreement (10% ish), but so far on the total number of outliers is almost unchanged
| -- for the market, or if a non-arm's-length sale is close | ||
| -- to market price, then the information from that sale is | ||
| -- still useful for our valuation models | ||
| WHEN has_flag AND flag_is_outlier | ||
| THEN | ||
| CASE | ||
| WHEN review_is_flip | ||
| THEN | ||
| 'Review: Flip' | ||
| WHEN NOT review_is_arms_length | ||
| THEN | ||
| 'Review: Non-Arms-Length' | ||
| ELSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question]: Could it make sense to switch these to something like "Review: Flip, Algorithm: $algorithm_reason" ? Since those are the operative conditions that produce the outlier?
Perhaps that would be confusing for a down stream data user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's an interesting idea, I like it. Since the only really relevant part of the algorithmic process (currently) is price, maybe we do something like Review: Flip, Algorithm: High Price? Or something like Review + Algorithm: High Price Flip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent maybe a little too much time trying to think about which one of these is better and I'm still pretty split. I think the second one is more concise and tells a bit more of a story.
The first one however is a bit more modular, and perhaps in the future its structure lends itself to more easily incorporate classifications that depend on a mix of both review and algorithm reasons. I'm good with either one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one however is a bit more modular, and perhaps in the future its structure lends itself to more easily incorporate classifications that depend on a mix of both review and algorithm reasons.
Yeah, I was thinking that too! I think modularity is more important than concision in this case, so I'll move forward with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in b6562bc.
|
@wagnerlmichael I took a stab at another refactor in 4b0f4fe to reorient the data model for better support for archiving flag/review state in the res and condo models (see ccao-data/model-res-avm#423). Key changes include:
Take a look and let me know what you think! |
wagnerlmichael
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, awesome work on this new view! I think it is going to be nice to work with. One small confirmation where a comment might be helpful, but I don't feel particularly strongly about it.
| OR outlier_reason LIKE 'Review: Non-Arms-Length%' | ||
| OR outlier_reason LIKE 'Review: Flip%' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although a "Non-arms-length" value doesn't necessarily lead to an outlier, this string match works because outlier_reason only contains Review: Non-Arms-Length if it was properly paired with the price outlier and therefore determined to be an outlier. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right! It felt a little bit risky to document this reasoning at this point in the query, since I can easily imagine us tweaking the logic up above in the outlier_reason CTE, e.g. to decide to expand the types of algorithmic flags that would determine an outlier for a flip/non-arms-length sale, while forgetting to update the explanatory comment down here. As a compromise, I beefed up the comment in this section to point readers to the comments on the outlier_reason CTE for details on why each reason is or is not an outlier in af0dcfc.
…ult.vw_pin_sale`
|
@wagnerlmichael This should be ready for a final round of review! My commits today starting with e0da4d9 are exclusively dedicated to cleaning up docs to reflect the final data model. |
This looks good to me! Thanks for beefing up the docs. |
…default.vw_pin_sale`
|
I forgot about the Core Team review requirement here, so we'll need @wrridgeway to take a quick look at this before we merge to make sure we didn't do anything obviously bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my end everything looks pretty solid here, I can't find anything that sticks out as wrong. I'm assuming all the unit tests for sale.vw_outlier would catch any unexpected results produced by the conditional logic.
…and simplify (#427) This PR reworks the performance report to work with out [new sales val data model additions](ccao-data/data-architecture#977) and add two things to the report: - outlier numbers (raw and proportion) per year - outlier proportion maps per nbhd incorporated from our geography group testing It also reorganizes the outlier reasons that are displayed --------- Co-authored-by: Jean Cochrane <jean@jeancochrane.com> Co-authored-by: Jean Cochrane <jeancochrane@users.noreply.github.com>
| @@ -66,14 +66,14 @@ SELECT | |||
| vps.sale_price, | |||
| vps.sale_date, | |||
| vps.sale_filter_is_outlier, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane should this become is_outlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sale_filter_is_outlier is now just an alias for is_outlier, in order to support backwards-compatibility for downstream consumers that are still relying on sale_filter_is_outlier (like the market tracker, I believe). It would probably be helpful in the long term to switch downstream consumers to is_outlier and remove this legacy field, but I don't think it's an urgent task.
| COALESCE(outlier.is_outlier, FALSE) AS sale_filter_is_outlier, |
| COALESCE(outlier.is_outlier, FALSE) AS is_outlier, |
Overview
This PR builds off of #967, suggesting a refactored data model that factors out the logic that combines our algorithmic sale flags with information from our human reviewers in order to produce a final outlier determination with clear reasons.
I went a little bit further with the data model than I expected -- I hope it's not too confusing! Though big refactors can be risky, this one feels appropriate because this data model is still very new. I'd be happy to walk through the changes on a call if it's too much to review from scratch.
See ccao-data/model-res-avm#423 for the corresponding res model PR.
Data model changes
New models that this PR introduces:
sale.vw_flag: View that pulls the most recent version of each algorithmically flagged sale fromsale.flagsale.vw_outlier: View that combines algorithmic sale flags with human-reviewed sale attributes in order to produce a final outlier determination and corresponding reasonsChanges to existing models:
sale.flag_overridetosale.flag_reviewfor clarity (we are not directly using the information to "override" sales val flags, we are just incorporating that information into our final decision, so "review" seems more neutral)default.vw_pin_sale.is_outlierreflecting the final decision on outlier status based onsale.vw_outlier(which pulls fromsale.flagandsale.flag_review)default.vw_pin_sale.outlier_reasonreflecting a human-readable string with the reason behind the sale's outlier status (also pulled fromsale.vw_outlier)default.vw_pin_salethat come directly fromsale.flagandsale.flag_reviewto use the prefixesflag_*andreview_*, to make the provenance of each column more obvious, e.g.is_arms_length➡️review_is_arms_lengthOpen questions
default.vw_pin_sale_combined? I haven't done so yet because I'm not sure of the status of it. Are we still using it? I almost wonder if we should remove that view to reduce complexity, now that we are getting new sales on a more regular schedule.