Skip to content

Commit

Permalink
apacheGH-37515: [C++] Remove memory address optimization from `Chunke…
Browse files Browse the repository at this point in the history
…dArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` if the `ChunkedArray` can have `NaN` values (apache#37579)

### Rationale for this change

`ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` assumes that if the two `ChunkedArray`s share the same memory address, then they must be equal. However, this optimization doesn't take into account that `NaN` values are not considered equal by default. Consequently, this can lead to surprising, inconsistent results from a user's perspective. For example, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)`  and `ChunkedArray::Equals(const ChunkedArray& other)` may return different results.

 The program below illustrates this inconsistency:

```c++
#include <arrow/api.h>
#include <arrow/type.h>

#include <iostream>
#include <math.h>
#include <sstream>

arrow::Result<std::shared_ptr<arrow::ChunkedArray>> make_chunked_array() {
    arrow::NumericBuilder<arrow::DoubleType> builder;
    
    std::shared_ptr<arrow::Array> array;
    ARROW_RETURN_NOT_OK(builder.AppendValues({0, 1, NAN, 2, 4}));
    ARROW_RETURN_NOT_OK(builder.Finish(&array));
    
    return arrow::ChunkedArray::Make({array});
}

int main(int argc, char *argv[])
{
    auto maybe_chunked_array = make_chunked_array();
    if (!maybe_chunked_array.ok()) {
        return -1;
    }
    auto chunked_array = std::move(maybe_chunked_array).ValueUnsafe();
    auto array = chunked_array->chunk(0);
    
    std::stringstream stream;

    stream << "chunked_array contents: ";
    stream << "\n\n";
    stream << chunked_array->ToString();
    stream << "\n\n";
    stream << "chunked_array->Equals(chunked_array): ";
    stream << chunked_array->Equals(chunked_array);
    stream << "chunked_array->Equals(*chunked_array): ";
    stream << chunked_array->Equals(*chunked_array);
    
    std::cout << stream.str() << std::endl;
}
```

Here is the output of this program:

```shell
chunked_array contents: 

[
  [
    0,
    1,
    nan,
    2,
    4
  ]
]

chunked_array->Equals(chunked_array): 1
chunked_array->Equals(*chunked_array): 0
```

### What changes are included in this PR?

Updated `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` to only return `true` early IF:
   - The two share the same address AND
   - They cannot have `NaN` values 

If both of those conditions are not satisfied, `ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other)` will do the element-by-element comparison.

### Are these changes tested?

Yes. I added a new test case called `EqualsSameAddressWithNaNs` to `chunked_array_test.cc`.

### Are there any user-facing changes?

Yes. `ChunkedArray::Equals(const std::shared_ptr<arrow::ChunkedArray>& other)` may return `false` even if the two `ChunkedArray`s have the same memory address. This will only occur if the `ChunkedArray`'s contain `NaN` values.

* Closes: apache#37515

Authored-by: Sarah Gilmore <sgilmore@mathworks.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
  • Loading branch information
sgilmore10 authored Sep 7, 2023
1 parent 7f604b2 commit 85ec07e
Show file tree
Hide file tree
Showing 2 changed files with 50 additions and 3 deletions.
24 changes: 21 additions & 3 deletions cpp/src/arrow/chunked_array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include "arrow/pretty_print.h"
#include "arrow/status.h"
#include "arrow/type.h"
#include "arrow/type_traits.h"
#include "arrow/util/checked_cast.h"
#include "arrow/util/logging.h"

Expand Down Expand Up @@ -111,13 +112,30 @@ bool ChunkedArray::Equals(const ChunkedArray& other) const {
.ok();
}

bool ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) const {
if (this == other.get()) {
return true;
namespace {

bool mayHaveNaN(const arrow::DataType& type) {
if (type.num_fields() == 0) {
return is_floating(type.id());
} else {
for (const auto& field : type.fields()) {
if (mayHaveNaN(*field->type())) {
return true;
}
}
}
return false;
}

} // namespace

bool ChunkedArray::Equals(const std::shared_ptr<ChunkedArray>& other) const {
if (!other) {
return false;
}
if (this == other.get() && !mayHaveNaN(*type_)) {
return true;
}
return Equals(*other.get());
}

Expand Down
29 changes: 29 additions & 0 deletions cpp/src/arrow/chunked_array_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,35 @@ TEST_F(TestChunkedArray, EqualsDifferingMetadata) {
ASSERT_TRUE(left.Equals(right));
}

TEST_F(TestChunkedArray, EqualsSameAddressWithNaNs) {
auto chunk_with_nan1 = ArrayFromJSON(float64(), "[0, 1, 2, NaN]");
auto chunk_without_nan1 = ArrayFromJSON(float64(), "[3, 4, 5]");
ArrayVector chunks1 = {chunk_with_nan1, chunk_without_nan1};
ASSERT_OK_AND_ASSIGN(auto chunked_array_with_nan1, ChunkedArray::Make(chunks1));
ASSERT_FALSE(chunked_array_with_nan1->Equals(chunked_array_with_nan1));

auto chunk_without_nan2 = ArrayFromJSON(float64(), "[6, 7, 8, 9]");
ArrayVector chunks2 = {chunk_without_nan1, chunk_without_nan2};
ASSERT_OK_AND_ASSIGN(auto chunked_array_without_nan1, ChunkedArray::Make(chunks2));
ASSERT_TRUE(chunked_array_without_nan1->Equals(chunked_array_without_nan1));

auto int32_array = ArrayFromJSON(int32(), "[0, 1, 2]");
auto float64_array_with_nan = ArrayFromJSON(float64(), "[0, 1, NaN]");
ArrayVector arrays1 = {int32_array, float64_array_with_nan};
std::vector<std::string> fieldnames = {"Int32Type", "Float64Type"};
ASSERT_OK_AND_ASSIGN(auto struct_with_nan, StructArray::Make(arrays1, fieldnames));
ArrayVector chunks3 = {struct_with_nan};
ASSERT_OK_AND_ASSIGN(auto chunked_array_with_nan2, ChunkedArray::Make(chunks3));
ASSERT_FALSE(chunked_array_with_nan2->Equals(chunked_array_with_nan2));

auto float64_array_without_nan = ArrayFromJSON(float64(), "[0, 1, 2]");
ArrayVector arrays2 = {int32_array, float64_array_without_nan};
ASSERT_OK_AND_ASSIGN(auto struct_without_nan, StructArray::Make(arrays2, fieldnames));
ArrayVector chunks4 = {struct_without_nan};
ASSERT_OK_AND_ASSIGN(auto chunked_array_without_nan2, ChunkedArray::Make(chunks4));
ASSERT_TRUE(chunked_array_without_nan2->Equals(chunked_array_without_nan2));
}

TEST_F(TestChunkedArray, SliceEquals) {
random::RandomArrayGenerator gen(42);

Expand Down

0 comments on commit 85ec07e

Please sign in to comment.