Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 16, 2025

…thods, deprecate old methods

Which issue does this PR close?

Rationale for this change

  1. bitwise_bin_op_helper and bitwise_unary_op_helper are somewhat hard to find and use
    as explained on WIP: special case bitwise ops when buffers are u64 aligned #8807

  2. I want to optimize bitwise operations even more heavily (see WIP: special case bitwise ops when buffers are u64 aligned #8807) so I want the implementations centralized so I can focus the efforts there

Also, I think these APIs I think cover the usecase explained by @jorstmann on #8561:

Building a new buffer by starting from an empty state and incrementally appending new bits (append_value, append_slice, append_packed_range and similar methods).

By creating a method on Buffer directly, it is easier to find, and it is clearer that
a new Buffer is being created.

What changes are included in this PR?

Changes:

  1. Add Buffer::from_bitwise_unary and Buffer::from_bitwise_binary methods that do the same thing as bitwise_unary_op_helper and bitwise_bin_op_helper but are easier to find and use
  2. Deprecate bitwise_unary_op_helper and bitwise_bin_op_helper in favor
    of the new Buffer methods
  3. Document the new methods, with examples (specifically that the bitwise operations
    operate on bits, not bytes and shouldn't do any cross byte operations)

Are these changes tested?

Yes, new doc tests

Are there any user-facing changes?

New APIs, some deprecated

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 16, 2025
@alamb alamb force-pushed the alamb/bitwise_ops branch from 3c68505 to 69e68a1 Compare November 16, 2025 14:02
@alamb alamb force-pushed the alamb/bitwise_ops branch from 69e68a1 to d5a3604 Compare November 16, 2025 14:04
@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    272.6±1.27ns        ? ?/sec    1.00    272.7±0.86ns        ? ?/sec
and_sliced    1.00   1096.3±7.89ns        ? ?/sec    1.00   1094.7±3.34ns        ? ?/sec
not           1.00    213.1±0.25ns        ? ?/sec    1.00    214.2±1.06ns        ? ?/sec
not_sliced    1.01    965.5±1.32ns        ? ?/sec    1.00    960.6±3.89ns        ? ?/sec
or            1.01    255.1±0.63ns        ? ?/sec    1.00    253.8±1.86ns        ? ?/sec
or_sliced     1.00   1228.0±7.56ns        ? ?/sec    1.00  1227.8±18.85ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    259.6±0.56ns    55.1 GB/sec    1.00    258.9±2.00ns    55.2 GB/sec
buffer_binary_ops/and_with_offset    1.12   1486.1±2.12ns     9.6 GB/sec    1.00   1322.8±9.40ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    239.3±0.60ns    59.8 GB/sec    1.07    256.3±1.96ns    55.8 GB/sec
buffer_binary_ops/or_with_offset     1.00   1355.4±2.50ns    10.6 GB/sec    1.10  1484.8±14.40ns     9.6 GB/sec
buffer_unary_ops/not                 1.14    257.5±0.71ns    37.0 GB/sec    1.00    225.9±3.19ns    42.2 GB/sec
buffer_unary_ops/not_with_offset     1.00    868.1±2.51ns    11.0 GB/sec    1.34  1160.1±14.15ns     8.2 GB/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group         alamb_bitwise_ops                      main
-----         -----------------                      ----
and           1.00    272.4±1.45ns        ? ?/sec    1.00    273.1±1.36ns        ? ?/sec
and_sliced    1.00   1096.0±1.60ns        ? ?/sec    1.00   1095.1±2.77ns        ? ?/sec
not           1.00    213.8±0.29ns        ? ?/sec    1.00    214.0±0.40ns        ? ?/sec
not_sliced    1.00    965.6±9.77ns        ? ?/sec    1.00    961.8±5.75ns        ? ?/sec
or            1.00    254.1±0.66ns        ? ?/sec    1.01    255.6±0.41ns        ? ?/sec
or_sliced     1.00   1225.5±2.12ns        ? ?/sec    1.00   1226.9±7.43ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/bitwise_ops (d5a3604) to ca4a0ae diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_bitwise_ops
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2025

🤖: Benchmark completed

Details

group                                alamb_bitwise_ops                      main
-----                                -----------------                      ----
buffer_binary_ops/and                1.00    259.7±0.55ns    55.1 GB/sec    1.00    259.3±4.36ns    55.2 GB/sec
buffer_binary_ops/and_with_offset    1.13   1486.2±3.20ns     9.6 GB/sec    1.00   1320.5±3.78ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    239.2±0.34ns    59.8 GB/sec    1.07    256.2±0.89ns    55.8 GB/sec
buffer_binary_ops/or_with_offset     1.00   1355.8±4.32ns    10.6 GB/sec    1.09   1483.7±4.32ns     9.6 GB/sec
buffer_unary_ops/not                 1.13    257.1±0.97ns    37.1 GB/sec    1.00    226.6±1.72ns    42.1 GB/sec
buffer_unary_ops/not_with_offset     1.00    863.6±3.06ns    11.0 GB/sec    1.32   1139.4±2.91ns     8.4 GB/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 18, 2025

The benchmarks show a slowdown for some operations for some reason

buffer_binary_ops/and_with_offset 1.13 1486.2±3.20ns 9.6 GB/sec 1.00 1320.5±3.78ns 10.8 GB/sec

However, given the duration of the benchmark, I am thinking maybe this is cache lines or something.

I have an idea of how to improve the benchmarks so they are less noisy (basically run them in a 100x loop)

let rem = op(left_chunks.remainder_bits(), right_chunks.remainder_bits());
// we are counting its starting from the least significant bit, to to_le_bytes should be correct
let rem = &rem.to_le_bytes()[0..remainder_bytes];
buffer.extend_from_slice(rem);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might do an extra allocation? Other places avoid this by preallocating the final u64 needed for the remainder as well (collect_bool)

Copy link
Contributor Author

@alamb alamb Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good call -- I will make the change

However, this is same code as how the current bitwise_binary_op does it, so I would expect no performance difference 🤔

https://github.com/apache/arrow-rs/pull/8854/files#diff-e7a951ab8abfeef1016ed4427a3aef25be5be470454caa1e1dd93e56968316b5L122

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however allocations during benchmarking seems to make benchmarking very noisy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I tried this

  pub fn from_bitwise_binary_op<F>(
        left: impl AsRef<[u8]>,
        left_offset_in_bits: usize,
        right: impl AsRef<[u8]>,
        right_offset_in_bits: usize,
        len_in_bits: usize,
        mut op: F,
    ) -> Buffer
    where
        F: FnMut(u64, u64) -> u64,
    {
        let left_chunks = BitChunks::new(left.as_ref(), left_offset_in_bits, len_in_bits);
        let right_chunks = BitChunks::new(right.as_ref(), right_offset_in_bits, len_in_bits);

        let remainder_bytes = ceil(left_chunks.remainder_len(), 8);
        // if it evenly divides into u64 chunks
        let buffer = if remainder_bytes == 0 {
            let chunks = left_chunks
                .iter()
                .zip(right_chunks.iter())
                .map(|(left, right)| op(left, right));
            // Soundness: `BitChunks` is a `BitChunks` iterator which
            // correctly reports its upper bound
            unsafe { MutableBuffer::from_trusted_len_iter(chunks) }
        } else {
            // Compute last u64 here so that we can reserve exact capacity
            let rem = op(left_chunks.remainder_bits(), right_chunks.remainder_bits());

            let chunks = left_chunks
                .iter()
                .zip(right_chunks.iter())
                .map(|(left, right)| op(left, right))
                .chain(std::iter::once(rem));
            // Soundness: `BitChunks` is a `BitChunks` iterator which
            // correctly reports its upper bound, and so is the `chain` iterator
            let mut buffer = unsafe { MutableBuffer::from_trusted_len_iter(chunks) };
            // Adjust the length down if last u64 is not fully used
            let extra_bytes = 8 - remainder_bytes;
            buffer.truncate(buffer.len() - extra_bytes);
            buffer
        };
        buffer.into()
    }

But it seems to be slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried making a version of MutableBuffer::from_trusted_len_iter that also added additional and it didn't seem to help either (perhaps because the benchmarks happen to avoid reallocation 🤔 )

    /// Like [`from_trusted_len_iter`] but can add additional capacity at the end
    /// in case the caller wants to add more data after the initial iterator.
    #[inline]
    pub unsafe fn from_trusted_len_iter_with_additional_capacity<T: ArrowNativeType, I: Iterator<Item = T>>(
        iterator: I,
        additional_capacity: usize,
    ) -> Self {
        let item_size = std::mem::size_of::<T>();
        let (_, upper) = iterator.size_hint();
        let upper = upper.expect("from_trusted_len_iter requires an upper limit");
        let len = upper * item_size;

        let mut buffer = MutableBuffer::new(len + additional_capacity);

        let mut dst = buffer.data.as_ptr();
        for item in iterator {
            // note how there is no reserve here (compared with `extend_from_iter`)
            let src = item.to_byte_slice().as_ptr();
            unsafe { std::ptr::copy_nonoverlapping(src, dst, item_size) };
            dst = unsafe { dst.add(item_size) };
        }
        assert_eq!(
            unsafe { dst.offset_from(buffer.data.as_ptr()) } as usize,
            len,
            "Trusted iterator length was not accurately reported"
        );
        buffer.len = len;
        buffer
    }

Copy link
Contributor

@Dandandan Dandandan Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a extend from trusted len iter in MutableBuffer? Other option is to use Vec::extend here as well.

F: FnMut(u64) -> u64,
{
// reserve capacity and set length so we can get a typed view of u64 chunks
let mut result =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we overwrite the results, we shouldn't need to initialize/zero out the array.

@Dandandan
Copy link
Contributor

The benchmarks show a slowdown for some operations for some reason

buffer_binary_ops/and_with_offset 1.13 1486.2±3.20ns 9.6 GB/sec 1.00 1320.5±3.78ns 10.8 GB/sec

However, given the duration of the benchmark, I am thinking maybe this is cache lines or something.

I have an idea of how to improve the benchmarks so they are less noisy (basically run them in a 100x loop)

Might also because of the allocation? Looks like and_with_offset and and are not a over a power of two inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants