Skip to content

Conversation

@joe-ucp
Copy link

@joe-ucp joe-ucp commented Nov 18, 2025

Here’s a version that fits their template and avoids mentioning Python:


Which issue does this PR close?

Rationale for this change

The bitwise operations used across arrow-buffer, arrow-arith, and arrow-select were spread across several overlapping helpers, which made it difficult to:

  • Know which APIs were the “canonical” bitwise primitives, and
  • Reason about and fix bugs involving offsets and null bitmaps, especially in nullif.

While working on bitwise kernels, we also found that nullif in arrow-select was fragile around sliced arrays, offsets, and null-count handling. This PR centralizes the bitwise logic and makes nullif explicitly respect the ArrayData bitmap layout invariants (length, offset, validity mapping).

What changes are included in this PR?

High-level:

  • Documented the bitmap layout contract

    • Added a documentation file (docs/arraydata_bitmap_layout_contract.md) that spells out:

      • How len and offset relate to validity bits,
      • How new arrays should be constructed (offset = 0),
      • And the nullif validity rule: result_valid(i) = V(i) & !C(i).
  • Strengthened core bitwise testing in arrow-buffer / arrow-arith

    • Added/extended tests that:

      • Compare new Buffer/MutableBuffer bitwise APIs against existing helpers across random data,
      • Exercise offsets and tail bits (including (offset + len) boundary cases),
      • Check that in-place (MutableBuffer) operations produce the same bits as allocating versions.
    • Added rand as a dev-dependency of arrow-arith to support reproducible randomized tests.

  • Refactored nullif to follow the layout contract

    • Introduced a helper:

      fn compute_nullif_validity(
          left_valid: &Buffer,
          left_offset: usize,
          cond_mask: &Buffer,
          cond_offset: usize,
          len: usize,
      ) -> Buffer

      which implements the core V(i) & !C(i) validity logic over buffers and bit offsets.

    • Updated nullif to:

      • Derive correct bit offsets from ArrayData for the left validity and condition mask,

      • Always build new result ArrayData with offset = 0,

      • Slice value/offset buffers so element 0 in the result corresponds to logical index 0 for:

        • Primitive arrays,
        • Boolean arrays,
        • Utf8 / LargeUtf8 arrays,
        • Struct arrays (with correct top-level validity propagation).
  • Nullif-focused tests

    • Added a focused unit test for compute_nullif_validity that exercises:

      • No-offset cases,
      • (offset + len) % 8 == 0 boundaries,
      • Cross-byte bit ranges.
    • Verified and fixed behavior so that the existing nullif tests all pass, including:

      • Offset regressions for ints and booleans,
      • String and struct slice tests,
      • null_count regression tests,
      • The existing nullif_fuzz test.

Net effect: nullif is now a thin consumer of the documented bitmap layout contract and the consolidated bitwise APIs.

Are these changes tested?

Yes.

The following were run locally:

# Core buffer + arithmetic crates
cargo test -p arrow-buffer --lib
cargo test -p arrow-arith --lib

# Select crate, including all nullif tests
cargo test -p arrow-select --lib
cargo test -p arrow-select --lib nullif -- --nocapture

All of the above pass, including the previously failing nullif regression and fuzz tests.

Are there any user-facing changes?

There are no breaking changes to public APIs.

User-visible effects:

  • nullif behavior is now strictly aligned with the documented bitmap layout contract for:

    • Sliced arrays,
    • Offsets,
    • Null masks,
    • Strings and structs.
  • This should only be observable as bug fixes in corner cases (e.g., nulls at specific offsets, struct/string slices), not API changes.

No new configuration or feature flags are introduced; this is primarily a correctness and maintainability improvement around existing behavior.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 18, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joe-ucp

This PR is likely too large for me to be able to realistically review it in any near term timeline

Would you be willing to work on some more focused PRs that each do some of the work described in #8806?

For example, I have already made a PR for the first item

Perhaps you could make a PR for

Add BooleanArray::binary and BooleanArray::unary functions that use the new Buffer functions internally

(you could base it on #8854 or use bitwise_unary_op_helper directly)

@alamb
Copy link
Contributor

alamb commented Nov 18, 2025

Splitting out the nullif improvements into its own PR would also help review substantially

@joe-ucp joe-ucp marked this pull request as draft November 18, 2025 22:50
@joe-ucp
Copy link
Author

joe-ucp commented Nov 18, 2025

@alamb Thanks so much for the detailed feedback and for pointing me at #8806 and #8854, I really appreciate it.

You’re absolutely right that this PR ended up too large and tries to do too many things at once. I’m still very new to contributing on GitHub and to this project, so this was me exploring the space a bit too aggressively 😅

I’m happy to rework this into more focused PRs:

First, I’ll open a small PR that just adds BooleanArray::binary / BooleanArray::unary wired up to the new buffer bitwise APIs (based on #8854 / the existing helpers).

Separately, I’ll split out the nullif / layout-contract improvement into their own PR to make that review more manageable.

I’ll start on the BooleanArray PR tomorrow and follow up with the nullif one after that. Once those are up, I’m also happy to keep iterating on the rest of the work described in #8806 in smaller reviewable chunks.

Thanks again for taking the time to review and for the guidance! 🫡

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate bitwise operation implementations

2 participants