Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix expect_column_most_common_value_to_be_in_set handling of ties #259

Conversation

lookslikeitsnot
Copy link
Contributor

Issue this PR Addresses/Closes

Closes #258
Add expected-to-fail tests for expect_column_most_common_value_to_be_in_set (#207)

Summary of Changes

Fixes tie handling of expect_column_most_common_value_to_be_in_set by rank()-ing the occurrences of column values instead of row_number()-ing them.
Adds a ties_okay to validate partial matches in case of ties.

Why Do We Need These Changes

expect_column_most_common_value_to_be_in_set doesn't work reliably when multiple column values have the same occurrence.

Reviewers

@clausherther

@clausherther clausherther self-requested a review May 18, 2023 18:55
Copy link
Contributor

@clausherther clausherther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Took a quick look only but have some comments.

select
value_field
from
value_count_top_n
where
value_field not in (select value_field from unique_set_values)

),
most_common_values_in_set as (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I still don't follow why we need this whole section. Maybe I don't fully understand what we mean by ties, but I would have thought changing the ranking to rank from row_number essentially allows for ties, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using rank will fix issues with "top_n" but I don't believe it will entirely solve the "ties_okay" issue:
what we want (to match Great Expectations' test) is to succeed for partial matches when ties_okay is set to true i.e.

models:
  - name: data_test
    columns:  
      - name: col_string_b
        tests:
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['ab', 'abc', 'abcd']
              top_n: 1
              ties_okay: true
              quote_values: false

Should succeed even though b is as common as all the other values in the data but is missing in the input. If we simply use rank, the test will fail since "most_common_values_not_in_set" only tells us whether or not all most common values were in the set.
Creating "most_common_values_in_set" allows us to check if any of the input values (['ab', 'abc', 'abcd'] in previous example) are contained in the top_n most common values (i.e. any partial match) and will succeed if there is any match (even with a missing b).
Maybe rewriting it as

{# Get the partial matches for ties_okay #}
most_common_values_in_set as (
    select 
        value_field 
    from 
        value_count_top_n 
    {{ dbt.intersect() }}
    select 
        value_field 
    from 
        unique_set_values
),

is clearer?
Another option is to remove most_common_values_in_set and use "having counts":

validation_errors as (
    
    select value_field 
    from most_common_values_not_in_set
    {% if ties_okay -%}
    group by 1
    {# Check that a partial match exists between the input set and the top n value 
        i.e. that the count of remaining most common values not in set is smaller 
        than the total count of most common values #}
    having not (
        (select count(*) from most_common_values_not_in_set) 
        < 
        (select count(*) from value_count_top_n)
    )
    {%- endif -%}
) 

This is more condensed but IMO, the having mixed with select count makes this harder to understand.
But maybe I'm missing an obvious way to get a partial match without "most_common_values_in_set"? Or maybe we don't want "ties_okay" to behave like that and only do full matches and fail for partial matches?

@lookslikeitsnot
Copy link
Contributor Author

lookslikeitsnot commented Sep 1, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] expect_column_most_common_value_to_be_in_set handling of ties
2 participants