-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix expect_column_most_common_value_to_be_in_set handling of ties #259
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Took a quick look only but have some comments.
macros/schema_tests/aggregate_functions/expect_column_most_common_value_to_be_in_set.sql
Outdated
Show resolved
Hide resolved
macros/schema_tests/aggregate_functions/expect_column_most_common_value_to_be_in_set.sql
Outdated
Show resolved
Hide resolved
macros/schema_tests/aggregate_functions/expect_column_most_common_value_to_be_in_set.sql
Outdated
Show resolved
Hide resolved
macros/schema_tests/aggregate_functions/expect_column_most_common_value_to_be_in_set.sql
Show resolved
Hide resolved
macros/schema_tests/aggregate_functions/expect_column_most_common_value_to_be_in_set.sql
Outdated
Show resolved
Hide resolved
select | ||
value_field | ||
from | ||
value_count_top_n | ||
where | ||
value_field not in (select value_field from unique_set_values) | ||
|
||
), | ||
most_common_values_in_set as ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I still don't follow why we need this whole section. Maybe I don't fully understand what we mean by ties, but I would have thought changing the ranking to rank
from row_number
essentially allows for ties, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using rank will fix issues with "top_n" but I don't believe it will entirely solve the "ties_okay" issue:
what we want (to match Great Expectations' test) is to succeed for partial matches when ties_okay is set to true i.e.
models:
- name: data_test
columns:
- name: col_string_b
tests:
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['ab', 'abc', 'abcd']
top_n: 1
ties_okay: true
quote_values: false
Should succeed even though b
is as common as all the other values in the data but is missing in the input. If we simply use rank, the test will fail since "most_common_values_not_in_set" only tells us whether or not all most common values were in the set.
Creating "most_common_values_in_set" allows us to check if any of the input values (['ab', 'abc', 'abcd']
in previous example) are contained in the top_n most common values (i.e. any partial match) and will succeed if there is any match (even with a missing b
).
Maybe rewriting it as
{# Get the partial matches for ties_okay #}
most_common_values_in_set as (
select
value_field
from
value_count_top_n
{{ dbt.intersect() }}
select
value_field
from
unique_set_values
),
is clearer?
Another option is to remove most_common_values_in_set and use "having counts":
validation_errors as (
select value_field
from most_common_values_not_in_set
{% if ties_okay -%}
group by 1
{# Check that a partial match exists between the input set and the top n value
i.e. that the count of remaining most common values not in set is smaller
than the total count of most common values #}
having not (
(select count(*) from most_common_values_not_in_set)
<
(select count(*) from value_count_top_n)
)
{%- endif -%}
)
This is more condensed but IMO, the having
mixed with select count
makes this harder to understand.
But maybe I'm missing an obvious way to get a partial match without "most_common_values_in_set"? Or maybe we don't want "ties_okay" to behave like that and only do full matches and fail for partial matches?
Any reason for the closing?
No response was ever given to the comment I made 3 months ago and the issue
is still open.
…On Fri, 1 Sept 2023 at 17:45, Claus Herther ***@***.***> wrote:
Closed #259 <#259>.
—
Reply to this email directly, view it on GitHub
<#259 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AID4QCE6GZWMZ7VVJTY2W4TXYH7KZANCNFSM6AAAAAAYFIHHFE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Issue this PR Addresses/Closes
Closes #258
Add expected-to-fail tests for expect_column_most_common_value_to_be_in_set (#207)
Summary of Changes
Fixes tie handling of expect_column_most_common_value_to_be_in_set by rank()-ing the occurrences of column values instead of row_number()-ing them.
Adds a ties_okay to validate partial matches in case of ties.
Why Do We Need These Changes
expect_column_most_common_value_to_be_in_set doesn't work reliably when multiple column values have the same occurrence.
Reviewers
@clausherther