Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make numpy_util.match work for non-integer inputs #95

Merged
merged 6 commits into from
Aug 20, 2024
Merged

make numpy_util.match work for non-integer inputs #95

merged 6 commits into from
Aug 20, 2024

Conversation

esheldon
Copy link
Owner

closes #79

@esheldon esheldon requested a review from erykoff August 20, 2024 15:13
Copy link
Collaborator

@erykoff erykoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more care needs to be taken for the floating point matching case.

esutil/tests/test_numpy_util.py Outdated Show resolved Hide resolved
@@ -1563,7 +1567,7 @@ def match(arr1input, arr2input, presorted=False):
sub1 = np.searchsorted(arr1, arr2, sorter=st1)

# check for out-of-bounds at the high end if necessary
if arr2.max() > arr1.max():
if is_string or arr2.max() > arr1.max():
(bad,) = np.where(sub1 == arr1.size)
sub1[bad] = arr1.size - 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment below on the PR because GH. But in the case of floating point inputs I don't think we want (sub2,) = np.where(arr1[st1[sub1]] == arr2) or (sub2,) = np.where(arr1[sub1] == arr2). Instead in these cases we need np.isclose() with some suitable defaults for rtol and atol and a way to override. (The numpy defaults for rtol and atol seem appropriate for 32-bit floats and not 64-bit doubles if that is relevant as well).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did mean this to be an exact match test. It is not a common use case, but not unheard of: you have written the same data out to multiple binary files and the only way to match them is through some fields you expect to match exactly

But maybe we could either add a keyword "close" or "inexact", or a separate function aimed at floating point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't comment on those lines because they weren't close enough to your changes. It's always been a GH review problem.

Anyway, either we (a) make it super clear that this has to be an exact floating point match, or (b) I think that adding a separate function or a keyword would make sense. Maybe that's a separate PR though, so if you just update the docstring now that would be sufficient.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says: This means arr1[ind1] == arr2[ind2] is true for all corresponding pairs, is that sufficient?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that floating point data should be called out explicitly. E.g. For floating-point data this implies exact matching with no floating-point tolerance.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

empty arrays if no matches are found. This means arr1[ind1] == arr2[ind2]
is true for all corresponding pairs. For floating-point data this implies
exact matching with no floating-point tolerance. The data type can be
string or bytes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> The data type can be int, float, string, or bytes?

@esheldon esheldon merged commit e2eca31 into master Aug 20, 2024
20 checks passed
@esheldon esheldon deleted the sstr branch August 20, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

numpy_util.match should work for strings
2 participants