Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix expect_column_most_common_value_to_be_in_set handling of ties #259

Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1009,8 +1009,12 @@ tests:
value_set: [0.5]
top_n: 1
quote_values: true # (Optional. Default is 'true'.)
data_type: "decimal" # (Optional. Default is 'decimal')
data_type: "decimal" # (Optional. Default is adapter-specific equivalent of 'decimal' with a scale provided by dbt.
# Using decimal/numeric without scale might result in unexpected behaviour with Snowflake where scale
# defaults to 0 resulting in values being rounded)
strictly: false # (Optional. Default is 'false'. Adds an 'or equal to' to the comparison operator for min/max)
ties_okay: true # (Optional. Default is 'false'. If true, the expectation will succeed if values outside
lookslikeitsnot marked this conversation as resolved.
Show resolved Hide resolved
# the designated set are as common (but not more common) than designated values)
```

### [expect_column_max_to_be_between](macros/schema_tests/aggregate_functions/expect_column_max_to_be_between.sql)
Expand Down
16 changes: 8 additions & 8 deletions integration_tests/models/schema_tests/data_test.sql
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
select
1 as idx,
'2020-10-21' as date_col,
cast(0 as {{ dbt.type_float() }}) as col_numeric_a,
cast(1 as {{ dbt.type_float() }}) as col_numeric_b,
cast(0 as {{ dbt.type_numeric() }}) as col_numeric_a,
cast(1 as {{ dbt.type_numeric() }}) as col_numeric_b,
'a' as col_string_a,
'b' as col_string_b,
cast(null as {{ dbt.type_string() }}) as col_null,
Expand All @@ -13,8 +13,8 @@ union all
select
2 as idx,
'2020-10-22' as date_col,
1 as col_numeric_a,
0 as col_numeric_b,
cast(1 as {{ dbt.type_numeric() }}) as col_numeric_a,
cast(0 as {{ dbt.type_numeric() }}) as col_numeric_b,
'b' as col_string_a,
'ab' as col_string_b,
null as col_null,
Expand All @@ -25,8 +25,8 @@ union all
select
3 as idx,
'2020-10-23' as date_col,
0.5 as col_numeric_a,
0.5 as col_numeric_b,
cast(0.5 as {{ dbt.type_numeric() }}) as col_numeric_a,
cast(0.5 as {{ dbt.type_numeric() }}) as col_numeric_b,
'c' as col_string_a,
'abc' as col_string_b,
null as col_null,
Expand All @@ -37,8 +37,8 @@ union all
select
4 as idx,
'2020-10-23' as date_col,
0.5 as col_numeric_a,
0.5 as col_numeric_b,
cast(0.5 as {{ dbt.type_numeric() }}) as col_numeric_a,
cast(0.5 as {{ dbt.type_numeric() }}) as col_numeric_b,
'c' as col_string_a,
'abcd' as col_string_b,
null as col_null,
Expand Down
116 changes: 116 additions & 0 deletions integration_tests/models/schema_tests/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -505,6 +505,63 @@ models:
value_set: [0.5]
top_n: 1
quote_values: false
# Expect success if all most common values at all n levels are in set
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [0.5, 0, 1]
top_n: 2
quote_values: false
# Expect failure if not all most common values at all n levels are in set
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [0.5, 0]
top_n: 2
quote_values: false
config:
error_if: "=0"
warn_if: "<>1"
# Expect success if some of the most common values at all n levels are in set and ties_okay is true
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [0.5, 0]
top_n: 2
ties_okay: true
quote_values: false
# Expect success if any of the top 2 most common levels value are in set and ties_okay is true
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [0]
top_n: 2
ties_okay: true
quote_values: false
# Expect success if any of the top most common level value is in set and ties_okay is true
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [0.5]
top_n: 2
ties_okay: true
quote_values: false
# Expect error if value is in column but not most common
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [1]
top_n: 1
quote_values: false
config:
error_if: "=0"
warn_if: "<>1"
# Expect error if value is in column but not most common and ties_okay is true
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [1]
top_n: 1
ties_okay: true
quote_values: false
config:
error_if: "=0"
warn_if: "<>1"
# Expect error if value not in column at any level
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: [123456789]
top_n: >
(select count(*) from {{ref('data_test')}})
quote_values: false
config:
error_if: "=0"
warn_if: "<>3"
- dbt_expectations.expect_column_values_to_be_increasing:
sort_column: col_numeric_a
strictly: false
Expand Down Expand Up @@ -538,6 +595,65 @@ models:
- dbt_expectations.expect_column_values_to_not_be_in_set:
value_set: ['a','c']
quote_values: true
# Expect error if not all most common values are in the set
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['b']
top_n: 1
config:
error_if: "=0"
warn_if: "<3"
# Expect success if not all most common values are in the set but ties_okay is set
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['b']
top_n: 1
ties_okay: true
# Expect error if none of the most common values are in the set and ties_okay is set
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['invalid_value']
top_n: 1
ties_okay: true
config:
error_if: "=0"
warn_if: "<4"
# Expect success if not all most common values are in the set but ties_okay is set
# and the set contains extra values
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['b', 'invalid_value']
top_n: 1
ties_okay: true
# Expect success if not all most common values are in the set but ties_okay is set
# and value is not first one of the column naturally ordered
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['ab']
top_n: 1
ties_okay: true
# Expect success if all most common values are in the set
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['b', 'ab', 'abc', 'abcd']
top_n: 1
# Expect success if all most common values are in the set
# and the set contains extra values
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['b', 'ab', 'abc', 'abcd', 'invalid_value']
top_n: 1
# Expect error if none of the most common values are in the set
# and the set contains extra values
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['invalid_value1', 'invalid_value2', 'invalid_value3', 'invalid_value4', 'invalid_value5']
top_n: 1
config:
error_if: "=0"
warn_if: "<4"
# Expect error if none of the most common values are in the set
# and the set contains extra values
- dbt_expectations.expect_column_most_common_value_to_be_in_set:
value_set: ['invalid_value1', 'invalid_value2', 'invalid_value3', 'invalid_value4', 'invalid_value5']
top_n: >
(select count(*) from {{ref('data_test')}})
ties_okay: true
config:
error_if: "=0"
warn_if: "<4"
- dbt_expectations.expect_column_value_lengths_to_be_between:
min_value: 1
max_value: 4
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@
value_set,
top_n,
quote_values=True,
data_type="decimal",
row_condition=None
data_type=None,
row_condition=None,
ties_okay=False
) -%}

{# For Snowflake, using a default 'decimal' instead of dbt.type_numeric()
rounds up the value when casting #}
{% set data_type = dbt.type_numeric() if not data_type else data_type %}
{{ adapter.dispatch('test_expect_column_most_common_value_to_be_in_set', 'dbt_expectations') (
model, column_name, value_set, top_n, quote_values, data_type, row_condition
model, column_name, value_set, top_n, quote_values, data_type, row_condition, ties_okay
) }}

{%- endtest %}
Expand All @@ -19,9 +22,10 @@
top_n,
quote_values,
data_type,
row_condition
row_condition,
ties_okay
) %}

{% set data_type = data_type %}
lookslikeitsnot marked this conversation as resolved.
Show resolved Hide resolved
with value_counts as (

select
Expand All @@ -48,7 +52,7 @@ value_counts_ranked as (

select
*,
row_number() over(order by value_count desc) as value_count_rank
rank() over(order by value_count desc) as value_count_rank
clausherther marked this conversation as resolved.
Show resolved Hide resolved
from
value_counts

Expand All @@ -60,7 +64,7 @@ value_count_top_n as (
from
value_counts_ranked
where
value_count_rank = {{ top_n }}
value_count_rank <= {{ top_n }}

),
set_values as (
Expand All @@ -83,15 +87,44 @@ unique_set_values as (
set_values

),
validation_errors as (
-- values from the model that are not in the set
most_common_values_not_in_set as (
select
value_field
from
value_count_top_n
where
value_field not in (select value_field from unique_set_values)

),
most_common_values_in_set as (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, I still don't follow why we need this whole section. Maybe I don't fully understand what we mean by ties, but I would have thought changing the ranking to rank from row_number essentially allows for ties, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using rank will fix issues with "top_n" but I don't believe it will entirely solve the "ties_okay" issue:
what we want (to match Great Expectations' test) is to succeed for partial matches when ties_okay is set to true i.e.

models:
  - name: data_test
    columns:  
      - name: col_string_b
        tests:
          - dbt_expectations.expect_column_most_common_value_to_be_in_set:
              value_set: ['ab', 'abc', 'abcd']
              top_n: 1
              ties_okay: true
              quote_values: false

Should succeed even though b is as common as all the other values in the data but is missing in the input. If we simply use rank, the test will fail since "most_common_values_not_in_set" only tells us whether or not all most common values were in the set.
Creating "most_common_values_in_set" allows us to check if any of the input values (['ab', 'abc', 'abcd'] in previous example) are contained in the top_n most common values (i.e. any partial match) and will succeed if there is any match (even with a missing b).
Maybe rewriting it as

{# Get the partial matches for ties_okay #}
most_common_values_in_set as (
    select 
        value_field 
    from 
        value_count_top_n 
    {{ dbt.intersect() }}
    select 
        value_field 
    from 
        unique_set_values
),

is clearer?
Another option is to remove most_common_values_in_set and use "having counts":

validation_errors as (
    
    select value_field 
    from most_common_values_not_in_set
    {% if ties_okay -%}
    group by 1
    {# Check that a partial match exists between the input set and the top n value 
        i.e. that the count of remaining most common values not in set is smaller 
        than the total count of most common values #}
    having not (
        (select count(*) from most_common_values_not_in_set) 
        < 
        (select count(*) from value_count_top_n)
    )
    {%- endif -%}
) 

This is more condensed but IMO, the having mixed with select count makes this harder to understand.
But maybe I'm missing an obvious way to get a partial match without "most_common_values_in_set"? Or maybe we don't want "ties_okay" to behave like that and only do full matches and fail for partial matches?

select
value_field
from
value_count_top_n
{{ dbt.except() }}
select
value_field
from
most_common_values_not_in_set
),
validation_errors as (
{% if ties_okay -%}
select
*
from
most_common_values_not_in_set
where
{#
If the intersection between the most common values and the values in the set is not empty,
succeed. Otherwise fail the test and select all the most common values from the column.
#}
(
select count(*)
from most_common_values_in_set
) = 0
{%- else -%}
select *
from most_common_values_not_in_set
{%- endif -%}
)

select *
Expand Down