Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix expect_column_most_common_value_to_be_in_set handling of ties #259
Fix expect_column_most_common_value_to_be_in_set handling of ties #259
Changes from 5 commits
db855a5
3e742e4
4a7501a
8a1906b
6cdf6ce
bbbffd2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, I still don't follow why we need this whole section. Maybe I don't fully understand what we mean by ties, but I would have thought changing the ranking to
rank
fromrow_number
essentially allows for ties, no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using rank will fix issues with "top_n" but I don't believe it will entirely solve the "ties_okay" issue:
what we want (to match Great Expectations' test) is to succeed for partial matches when ties_okay is set to true i.e.
Should succeed even though
b
is as common as all the other values in the data but is missing in the input. If we simply use rank, the test will fail since "most_common_values_not_in_set" only tells us whether or not all most common values were in the set.Creating "most_common_values_in_set" allows us to check if any of the input values (
['ab', 'abc', 'abcd']
in previous example) are contained in the top_n most common values (i.e. any partial match) and will succeed if there is any match (even with a missingb
).Maybe rewriting it as
is clearer?
Another option is to remove most_common_values_in_set and use "having counts":
This is more condensed but IMO, the
having
mixed withselect count
makes this harder to understand.But maybe I'm missing an obvious way to get a partial match without "most_common_values_in_set"? Or maybe we don't want "ties_okay" to behave like that and only do full matches and fail for partial matches?