Add `Buffer::from_bitwise_unary` and `Buffer::from_bitwise_binary` me… #8854

alamb · 2025-11-16T13:46:52Z

…thods, deprecate old methods

Which issue does this PR close?

part of Consolidate bitwise operation implementations #8806

Rationale for this change

bitwise_bin_op_helper and bitwise_unary_op_helper are somewhat hard to find and use
as explained on WIP: special case bitwise ops when buffers are u64 aligned #8807
I want to optimize bitwise operations even more heavily (see WIP: special case bitwise ops when buffers are u64 aligned #8807) so I want the implementations centralized so I can focus the efforts there

Also, I think these APIs I think cover the usecase explained by @jorstmann on #8561:

Building a new buffer by starting from an empty state and incrementally appending new bits (append_value, append_slice, append_packed_range and similar methods).

By creating a method on Buffer directly, it is easier to find, and it is clearer that
a new Buffer is being created.

What changes are included in this PR?

Changes:

Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary methods that do the same thing as bitwise_unary_op_helper and bitwise_bin_op_helper but are easier to find and use
Deprecate bitwise_unary_op_helper and bitwise_bin_op_helper in favor
of the new Buffer methods
Document the new methods, with examples (specifically that the bitwise operations
operate on bits, not bytes and shouldn't do any cross byte operations)

Are these changes tested?

Yes, new doc tests

Are there any user-facing changes?

New APIs, some deprecated

…thods, deprecate old methods

alamb · 2025-11-16T14:06:09Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

alamb · 2025-11-16T14:10:03Z

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    272.6±1.27ns        ? ?/sec    1.00    272.7±0.86ns        ? ?/sec
and_sliced    1.00   1096.3±7.89ns        ? ?/sec    1.00   1094.7±3.34ns        ? ?/sec
not           1.00    213.1±0.25ns        ? ?/sec    1.00    214.2±1.06ns        ? ?/sec
not_sliced    1.01    965.5±1.32ns        ? ?/sec    1.00    960.6±3.89ns        ? ?/sec
or            1.01    255.1±0.63ns        ? ?/sec    1.00    253.8±1.86ns        ? ?/sec
or_sliced     1.00   1228.0±7.56ns        ? ?/sec    1.00  1227.8±18.85ns        ? ?/sec

alamb · 2025-11-16T14:10:06Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

alamb · 2025-11-16T14:13:58Z

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    259.6±0.56ns    55.1 GB/sec    1.00    258.9±2.00ns    55.2 GB/sec
buffer_binary_ops/and_with_offset    1.12   1486.1±2.12ns     9.6 GB/sec    1.00   1322.8±9.40ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    239.3±0.60ns    59.8 GB/sec    1.07    256.3±1.96ns    55.8 GB/sec
buffer_binary_ops/or_with_offset     1.00   1355.4±2.50ns    10.6 GB/sec    1.10  1484.8±14.40ns     9.6 GB/sec
buffer_unary_ops/not                 1.14    257.5±0.71ns    37.0 GB/sec    1.00    225.9±3.19ns    42.2 GB/sec
buffer_unary_ops/not_with_offset     1.00    868.1±2.51ns    11.0 GB/sec    1.34  1160.1±14.15ns     8.2 GB/sec

alamb · 2025-11-16T14:14:01Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

alamb · 2025-11-16T14:17:55Z

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    272.4±1.45ns        ? ?/sec    1.00    273.1±1.36ns        ? ?/sec
and_sliced    1.00   1096.0±1.60ns        ? ?/sec    1.00   1095.1±2.77ns        ? ?/sec
not           1.00    213.8±0.29ns        ? ?/sec    1.00    214.0±0.40ns        ? ?/sec
not_sliced    1.00    965.6±9.77ns        ? ?/sec    1.00    961.8±5.75ns        ? ?/sec
or            1.00    254.1±0.66ns        ? ?/sec    1.01    255.6±0.41ns        ? ?/sec
or_sliced     1.00   1225.5±2.12ns        ? ?/sec    1.00   1226.9±7.43ns        ? ?/sec

alamb · 2025-11-16T14:17:58Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

alamb · 2025-11-16T14:21:49Z

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    259.7±0.55ns    55.1 GB/sec    1.00    259.3±4.36ns    55.2 GB/sec
buffer_binary_ops/and_with_offset    1.13   1486.2±3.20ns     9.6 GB/sec    1.00   1320.5±3.78ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    239.2±0.34ns    59.8 GB/sec    1.07    256.2±0.89ns    55.8 GB/sec
buffer_binary_ops/or_with_offset     1.00   1355.8±4.32ns    10.6 GB/sec    1.09   1483.7±4.32ns     9.6 GB/sec
buffer_unary_ops/not                 1.13    257.1±0.97ns    37.1 GB/sec    1.00    226.6±1.72ns    42.1 GB/sec
buffer_unary_ops/not_with_offset     1.00    863.6±3.06ns    11.0 GB/sec    1.32   1139.4±2.91ns     8.4 GB/sec

alamb · 2025-11-18T22:53:19Z

The benchmarks show a slowdown for some operations for some reason

buffer_binary_ops/and_with_offset 1.13 1486.2±3.20ns 9.6 GB/sec 1.00 1320.5±3.78ns 10.8 GB/sec

However, given the duration of the benchmark, I am thinking maybe this is cache lines or something.

I have an idea of how to improve the benchmarks so they are less noisy (basically run them in a 100x loop)

Dandandan · 2025-11-19T06:54:39Z

arrow-buffer/src/buffer/immutable.rs

+        let rem = op(left_chunks.remainder_bits(), right_chunks.remainder_bits());
+        // we are counting its starting from the least significant bit, to to_le_bytes should be correct
+        let rem = &rem.to_le_bytes()[0..remainder_bytes];
+        buffer.extend_from_slice(rem);


This might do an extra allocation? Other places avoid this by preallocating the final u64 needed for the remainder as well (collect_bool)

That is a good call -- I will make the change

However, this is same code as how the current bitwise_binary_op does it, so I would expect no performance difference 🤔

https://github.com/apache/arrow-rs/pull/8854/files#diff-e7a951ab8abfeef1016ed4427a3aef25be5be470454caa1e1dd93e56968316b5L122

I agree, however allocations during benchmarking seems to make benchmarking very noisy.

🤔 I tried this

pub fn from_bitwise_binary_op<F>( left: impl AsRef<[u8]>, left_offset_in_bits: usize, right: impl AsRef<[u8]>, right_offset_in_bits: usize, len_in_bits: usize, mut op: F, ) -> Buffer where F: FnMut(u64, u64) -> u64, { let left_chunks = BitChunks::new(left.as_ref(), left_offset_in_bits, len_in_bits); let right_chunks = BitChunks::new(right.as_ref(), right_offset_in_bits, len_in_bits); let remainder_bytes = ceil(left_chunks.remainder_len(), 8); // if it evenly divides into u64 chunks let buffer = if remainder_bytes == 0 { let chunks = left_chunks .iter() .zip(right_chunks.iter()) .map(|(left, right)| op(left, right)); // Soundness: `BitChunks` is a `BitChunks` iterator which // correctly reports its upper bound unsafe { MutableBuffer::from_trusted_len_iter(chunks) } } else { // Compute last u64 here so that we can reserve exact capacity let rem = op(left_chunks.remainder_bits(), right_chunks.remainder_bits()); let chunks = left_chunks .iter() .zip(right_chunks.iter()) .map(|(left, right)| op(left, right)) .chain(std::iter::once(rem)); // Soundness: `BitChunks` is a `BitChunks` iterator which // correctly reports its upper bound, and so is the `chain` iterator let mut buffer = unsafe { MutableBuffer::from_trusted_len_iter(chunks) }; // Adjust the length down if last u64 is not fully used let extra_bytes = 8 - remainder_bytes; buffer.truncate(buffer.len() - extra_bytes); buffer }; buffer.into() }

But it seems to be slower.

I also tried making a version of MutableBuffer::from_trusted_len_iter that also added additional and it didn't seem to help either (perhaps because the benchmarks happen to avoid reallocation 🤔 )

/// Like [`from_trusted_len_iter`] but can add additional capacity at the end /// in case the caller wants to add more data after the initial iterator. #[inline] pub unsafe fn from_trusted_len_iter_with_additional_capacity<T: ArrowNativeType, I: Iterator<Item = T>>( iterator: I, additional_capacity: usize, ) -> Self { let item_size = std::mem::size_of::<T>(); let (_, upper) = iterator.size_hint(); let upper = upper.expect("from_trusted_len_iter requires an upper limit"); let len = upper * item_size; let mut buffer = MutableBuffer::new(len + additional_capacity); let mut dst = buffer.data.as_ptr(); for item in iterator { // note how there is no reserve here (compared with `extend_from_iter`) let src = item.to_byte_slice().as_ptr(); unsafe { std::ptr::copy_nonoverlapping(src, dst, item_size) }; dst = unsafe { dst.add(item_size) }; } assert_eq!( unsafe { dst.offset_from(buffer.data.as_ptr()) } as usize, len, "Trusted iterator length was not accurately reported" ); buffer.len = len; buffer }

There is also a extend from trusted len iter in MutableBuffer? Other option is to use Vec::extend here as well.

Dandandan · 2025-11-19T06:57:09Z

arrow-buffer/src/buffer/immutable.rs

+        F: FnMut(u64) -> u64,
+    {
+        // reserve capacity and set length so we can get a typed view of u64 chunks
+        let mut result =


As we overwrite the results, we shouldn't need to initialize/zero out the array.

Dandandan · 2025-11-19T08:36:41Z

The benchmarks show a slowdown for some operations for some reason

buffer_binary_ops/and_with_offset 1.13 1486.2±3.20ns 9.6 GB/sec 1.00 1320.5±3.78ns 10.8 GB/sec

However, given the duration of the benchmark, I am thinking maybe this is cache lines or something.

I have an idea of how to improve the benchmarks so they are less noisy (basically run them in a 100x loop)

Might also because of the allocation? Looks like and_with_offset and and are not a over a power of two inputs.

github-actions bot added the arrow Changes to the arrow crate label Nov 16, 2025

alamb force-pushed the alamb/bitwise_ops branch from 3c68505 to 69e68a1 Compare November 16, 2025 14:02

Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary me…

d5a3604

…thods, deprecate old methods

alamb force-pushed the alamb/bitwise_ops branch from 69e68a1 to d5a3604 Compare November 16, 2025 14:04

Merge branch 'main' into alamb/bitwise_ops

cb2ae37

This was referenced Nov 18, 2025

Consolidate bitwise operation implementations #8806

Open

#8806 Consolidate bitwise operations and make nullif respect ArrayData bitmap layout contract #8869

Draft

Dandandan reviewed Nov 19, 2025

View reviewed changes

alamb mentioned this pull request Nov 19, 2025

Run boolean and bitwise kernels for longer to reduce noise #8872

Draft

1 task

Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary me… #8854

Are you sure you want to change the base?

Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary me… #8854

Uh oh!

Conversation

alamb commented Nov 16, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 16, 2025

Uh oh!

alamb commented Nov 18, 2025

Uh oh!

Dandandan Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `Buffer::from_bitwise_unary` and `Buffer::from_bitwise_binary` me… #8854

Add `Buffer::from_bitwise_unary` and `Buffer::from_bitwise_binary` me… #8854

alamb Nov 19, 2025 •

edited

Loading

Dandandan Nov 19, 2025 •

edited

Loading