make numpy_util.match work for non-integer inputs #95

esheldon · 2024-08-20T15:11:51Z

closes #79

erykoff

I think more care needs to be taken for the floating point matching case.

esutil/tests/test_numpy_util.py

erykoff · 2024-08-20T16:46:09Z

esutil/numpy_util.py

@@ -1563,7 +1567,7 @@ def match(arr1input, arr2input, presorted=False):
    sub1 = np.searchsorted(arr1, arr2, sorter=st1)

    # check for out-of-bounds at the high end if necessary
-    if arr2.max() > arr1.max():
+    if is_string or arr2.max() > arr1.max():
        (bad,) = np.where(sub1 == arr1.size)
        sub1[bad] = arr1.size - 1


I can't comment below on the PR because GH. But in the case of floating point inputs I don't think we want (sub2,) = np.where(arr1[st1[sub1]] == arr2) or (sub2,) = np.where(arr1[sub1] == arr2). Instead in these cases we need np.isclose() with some suitable defaults for rtol and atol and a way to override. (The numpy defaults for rtol and atol seem appropriate for 32-bit floats and not 64-bit doubles if that is relevant as well).

I did mean this to be an exact match test. It is not a common use case, but not unheard of: you have written the same data out to multiple binary files and the only way to match them is through some fields you expect to match exactly

But maybe we could either add a keyword "close" or "inexact", or a separate function aimed at floating point.

I couldn't comment on those lines because they weren't close enough to your changes. It's always been a GH review problem.

Anyway, either we (a) make it super clear that this has to be an exact floating point match, or (b) I think that adding a separate function or a keyword would make sense. Maybe that's a separate PR though, so if you just update the docstring now that would be sufficient.

The doc says: This means arr1[ind1] == arr2[ind2] is true for all corresponding pairs, is that sufficient?

I think that floating point data should be called out explicitly. E.g. For floating-point data this implies exact matching with no floating-point tolerance.

erykoff · 2024-08-20T17:46:57Z

esutil/numpy_util.py

+    empty arrays if no matches are found.  This means arr1[ind1] == arr2[ind2]
+    is true for all corresponding pairs.  For floating-point data this implies
+    exact matching with no floating-point tolerance.  The data type can be
+    string or bytes.


-> The data type can be int, float, string, or bytes?

make numpy_util.match work for non-integer inputs

444cbf6

esheldon requested a review from erykoff August 20, 2024 15:13

erykoff reviewed Aug 20, 2024

View reviewed changes

esheldon added 4 commits August 20, 2024 13:03

rename test to test_match_nomatch

079c0b0

update doc

bae187d

update release notes, bump version

95abee4

better note

19551bd

erykoff approved these changes Aug 20, 2024

View reviewed changes

docs

6273689

esheldon merged commit e2eca31 into master Aug 20, 2024
20 checks passed

esheldon deleted the sstr branch August 20, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make numpy_util.match work for non-integer inputs #95

make numpy_util.match work for non-integer inputs #95

esheldon commented Aug 20, 2024

erykoff left a comment

erykoff Aug 20, 2024

esheldon Aug 20, 2024

erykoff Aug 20, 2024

esheldon Aug 20, 2024

erykoff Aug 20, 2024

esheldon Aug 20, 2024

erykoff Aug 20, 2024

make numpy_util.match work for non-integer inputs #95

make numpy_util.match work for non-integer inputs #95

Conversation

esheldon commented Aug 20, 2024

erykoff left a comment

Choose a reason for hiding this comment

erykoff Aug 20, 2024

Choose a reason for hiding this comment

esheldon Aug 20, 2024

Choose a reason for hiding this comment

erykoff Aug 20, 2024

Choose a reason for hiding this comment

esheldon Aug 20, 2024

Choose a reason for hiding this comment

erykoff Aug 20, 2024

Choose a reason for hiding this comment

esheldon Aug 20, 2024

Choose a reason for hiding this comment

erykoff Aug 20, 2024

Choose a reason for hiding this comment