-
Notifications
You must be signed in to change notification settings - Fork 5
Add outlier_reason and source_is_outlier columns to default.vw_pin_sale
#967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add outlier_reason and source_is_outlier columns to default.vw_pin_sale
#967
Conversation
outlier_reason and source_is_outlier columns to default.vw_pin_sale
| OR flag_override.has_class_change IS NOT NULL | ||
| OR flag_override.has_characteristic_change IS NOT NULL | ||
| OR flag_override.requires_field_check IS NOT NULL | ||
| THEN 'analyst' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In is_outlier and source_is_outlier all of the null checking strikes me as perhaps a little brittle? One idea is we could add a test to make sure that there is at least one non null field in the columns that drive is_outlier in sale.flag_override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Praise] This is a really good point! I think tests are a good idea, although I also think we may be missing a layer of abstraction -- see my comment on the PR body for details. If we decide to move in the direction I suggest, we can talk about tests that would make sense for that data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your point about the null-checking logic being kind of brittle here is a very good one, and to me indicates a smell that we might be missing a layer of abstraction. What do you think about either a CTE or a view that is dedicated solely to combining our sales val algorithm with our analyst overrides, such that we can reduce duplication of brittle null-checking logic?
I'm imagining a schema like this:
# Algorithm fields
sv_is_outlier
sv_is_ptax_outlier
sv_is_heuristic_outlier
sv_outlier_reason1
sv_outlier_reason2
sv_outlier_reason3
sv_run_id
sv_version
# Analyst fields
ovrd_is_outlier
ovrd_is_arms_length
ovrd_is_flip
ovrd_has_class_change
ovrd_has_characteristic_change
ovrd_requires_field_check
ovrd_work_drawer
# Combined fields
has_sv_flag
has_flag_override
# Maybe we include this field, or maybe we leave it up to `default.vw_pin_sale`
is_outlierI think the key changes here would be:
- Clearly prefixing the override fields, per your + Tim's feedback
- Encapsulating the logic that determines whether a sale has been flagged and/or reviewed in the
has_sv_flagandhas_flag_overridecolumns, which would then allow downstream columns likeis_outlier,source_is_outlier, andoutlier_reasonto more clearly define their conditionals based on the presence or absence of certain types of flagging, not the presence or absence of specific null fields
This logic could either live as a CTE in default.vw_pin_sale or as its own dedicated view in the sale database (e.g. sale.vw_flag). I'm agnostic as to which direction to go, though I'll note that one advantage of a view is that we would have the option of defining tests on it that should theoretically run quite fast, because each test would have to scan a lot less data than any given test on default.vw_pin_sale. An advantage of a CTE, on the other hand, is that it's one less file a reader has to skip to when they're trying to understand the structure of default.vw_pin_sale.
Curious what you think about all this! I've kept my review pretty high-level because I think this will be a sizeable change if we decide to undertake it.
| OR flag_override.has_class_change IS NOT NULL | ||
| OR flag_override.has_characteristic_change IS NOT NULL | ||
| OR flag_override.requires_field_check IS NOT NULL | ||
| THEN 'analyst' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Praise] This is a really good point! I think tests are a good idea, although I also think we may be missing a layer of abstraction -- see my comment on the PR body for details. If we decide to move in the direction I suggest, we can talk about tests that would make sense for that data model.
|
Quick question re: # Algorithm fields
sv_is_outlier
sv_is_ptax_outlier
sv_is_heuristic_outlier
sv_outlier_reason1
sv_outlier_reason2
sv_outlier_reason3
sv_run_id
sv_version
# Analyst fields
ovrd_is_outlier
ovrd_is_arms_length
ovrd_is_flip
ovrd_has_class_change
ovrd_has_characteristic_change
ovrd_requires_field_check
ovrd_work_drawer
# Combined fields
has_sv_flag
has_flag_override
# Maybe we include this field, or maybe we leave it up to `default.vw_pin_sale`
is_outlierIn this case would the "combined final judgment (our human review + sv_algorithm flagged)" as to whether something is an outlier be reflected by the "is_outlier" column? (Also, I would think a view rather than CTE might be a little more helpful - in terms of having this mapped out clearly + s, for the purposes of later comparing any other outlier-flagging approaches or analysis that we want to do- will defer to your guys judgment though.) |
Yup! Though I'm uncertain whether it makes sense for that logic to live in this proposed view or to continue to live in
|
These both make sense to me- especially point number 2, with regard to future scaling/experiments. |
|
Thanks all for all of the thoughts. I'm gonna move forward with the view! |
|
@wagnerlmichael I think this PR is superceded by #977, right? |
Yep, I think this one can be closed out |
This PR adds two columns:
outlier_reason- an array column that contains all of the analyst and algorithmic outlier reasons.source_is_outlier- A column that indicates whether or not we used the analyst override or the fallback statistical model for theis_outliercolumn. This column should make analytics comparing the two types of outliers sources easier, and also make it simpler to integrate the manual override integration into the model outliers report.Sample query for inspection: