make hash function in RepartitionExec configurable #17648

adriangb · 2025-09-18T19:28:11Z

Needed immediately for some testing needs in #17632. But this is long term useful so that users can customize the hash function used so that e.g. it is consistent in a distributed environment.

adriangb · 2025-09-18T21:01:04Z

datafusion/physical-plan/src/joins/hash_join/stream.rs

            let mut bitmap = build_side.left_data.visited_indices_bitmap().lock();
            left_indices.iter().flatten().for_each(|x| {
-                bitmap.set_bit(x as usize, true);
+                bitmap.set_bit(usize::try_from(x).expect("should fit"), true);


datafusion/physical-plan/src/repartition/hash.rs

jonathanc-n · 2025-09-18T21:40:03Z

cc @rkrishn7

Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com>

adriangb · 2025-09-18T21:49:31Z

datafusion/physical-plan/src/repartition/hash.rs

+        }
+
+        // Create hash buffer and compute hashes using DataFusion's internal algorithm
+        let mut hashes_buffer = vec![0u64; array_len];


I will flag that previously the same vec was re-used with a capacity bump followed by a clear. Now we're creating a new one for each batch. It's some more allocations, but we're also pre-allocating the entire size, etc. I'm not sure if this will be measurable or not.

I think we should maintain that behavior if possible. If the primary goal here is to encapsulate re-partitioning logic in this module, can we add a function for this? Something like:

/// Calculates the partition used by the repartition operator for each row. All arrays should have the same length. fn compute_partition_indices(buf: &mut Vec<u64>, arrays: &[ArrayRef], num_partitions: usize) -> Result<()> { buf.resize(arrays[0].len(), 0); create_hashes(arrays, REPARTITION_RANDOM_STATE, buf); buf.iter_mut().for_each(|hash| *hash %= num_partitions as u64); }

That way the repartitioning code can simply utilize this with the once allocated vector. But the dynamic filter can use the UDF, which still utilizes this under the hood

Yes agreed, I think that's better. I'll cook something up.

rkrishn7 · 2025-09-19T01:52:06Z

datafusion/physical-plan/src/repartition/hash.rs

+        // Convert all arguments to arrays
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+
+        // Check that all arrays have the same length


I think ColumnarValue::values_to_arrays does this check already

rkrishn7 · 2025-09-19T02:14:04Z

datafusion/physical-plan/src/repartition/hash.rs

+        }
+
+        // Create hash buffer and compute hashes using DataFusion's internal algorithm
+        let mut hashes_buffer = vec![0u64; array_len];


I think we should maintain that behavior if possible. If the primary goal here is to encapsulate re-partitioning logic in this module, can we add a function for this? Something like:

/// Calculates the partition used by the repartition operator for each row. All arrays should have the same length. fn compute_partition_indices(buf: &mut Vec<u64>, arrays: &[ArrayRef], num_partitions: usize) -> Result<()> { buf.resize(arrays[0].len(), 0); create_hashes(arrays, REPARTITION_RANDOM_STATE, buf); buf.iter_mut().for_each(|hash| *hash %= num_partitions as u64); }

That way the repartitioning code can simply utilize this with the once allocated vector. But the dynamic filter can use the UDF, which still utilizes this under the hood

make hash function in RepartitionExec configurable

8b642cb

adriangb marked this pull request as draft September 18, 2025 19:28

github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 18, 2025

adriangb added 3 commits September 18, 2025 15:28

fix imports

711746f

fix

ca3ba01

fix lints

c28abe1

adriangb marked this pull request as ready for review September 18, 2025 21:00

adriangb commented Sep 18, 2025

View reviewed changes

adriangb mentioned this pull request Sep 18, 2025

Refactor hash join dynamic filtering for progressive bounds application #17632

Open

4 tasks

jonathanc-n reviewed Sep 18, 2025

View reviewed changes

datafusion/physical-plan/src/repartition/hash.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/repartition/hash.rs Outdated Show resolved Hide resolved

adriangb and others added 3 commits September 18, 2025 17:46

Update datafusion/physical-plan/src/repartition/hash.rs

8fe519d

Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com>

fix comment

a46c929

move

7afdeea

adriangb commented Sep 18, 2025

View reviewed changes

rkrishn7 reviewed Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make hash function in RepartitionExec configurable #17648

make hash function in RepartitionExec configurable #17648

Uh oh!

adriangb commented Sep 18, 2025 •

edited

Loading

Uh oh!

adriangb Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

jonathanc-n commented Sep 18, 2025

Uh oh!

adriangb Sep 18, 2025

Uh oh!

rkrishn7 Sep 19, 2025

Uh oh!

adriangb Sep 20, 2025

Uh oh!

rkrishn7 Sep 19, 2025

Uh oh!

rkrishn7 Sep 19, 2025

Uh oh!

Uh oh!

make hash function in RepartitionExec configurable #17648

Are you sure you want to change the base?

make hash function in RepartitionExec configurable #17648

Uh oh!

Conversation

adriangb commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonathanc-n commented Sep 18, 2025

Uh oh!

adriangb Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

rkrishn7 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adriangb commented Sep 18, 2025 •

edited

Loading