-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make numpy_util.match work for non-integer inputs #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more care needs to be taken for the floating point matching case.
@@ -1563,7 +1567,7 @@ def match(arr1input, arr2input, presorted=False): | |||
sub1 = np.searchsorted(arr1, arr2, sorter=st1) | |||
|
|||
# check for out-of-bounds at the high end if necessary | |||
if arr2.max() > arr1.max(): | |||
if is_string or arr2.max() > arr1.max(): | |||
(bad,) = np.where(sub1 == arr1.size) | |||
sub1[bad] = arr1.size - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment below on the PR because GH. But in the case of floating point inputs I don't think we want (sub2,) = np.where(arr1[st1[sub1]] == arr2)
or (sub2,) = np.where(arr1[sub1] == arr2)
. Instead in these cases we need np.isclose()
with some suitable defaults for rtol
and atol
and a way to override. (The numpy defaults for rtol
and atol
seem appropriate for 32-bit floats and not 64-bit doubles if that is relevant as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did mean this to be an exact match test. It is not a common use case, but not unheard of: you have written the same data out to multiple binary files and the only way to match them is through some fields you expect to match exactly
But maybe we could either add a keyword "close" or "inexact", or a separate function aimed at floating point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't comment on those lines because they weren't close enough to your changes. It's always been a GH review problem.
Anyway, either we (a) make it super clear that this has to be an exact floating point match, or (b) I think that adding a separate function or a keyword would make sense. Maybe that's a separate PR though, so if you just update the docstring now that would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc says: This means arr1[ind1] == arr2[ind2] is true for all corresponding pairs
, is that sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that floating point data should be called out explicitly. E.g. For floating-point data this implies exact matching with no floating-point tolerance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
esutil/numpy_util.py
Outdated
empty arrays if no matches are found. This means arr1[ind1] == arr2[ind2] | ||
is true for all corresponding pairs. For floating-point data this implies | ||
exact matching with no floating-point tolerance. The data type can be | ||
string or bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> The data type can be int, float, string, or bytes
?
closes #79