[SPARK-27853][SQL] Enable custom partitioning logic for Dataset via Partitioner #53375

yushesp · 2025-12-07T09:16:37Z

What changes were proposed in this pull request?

This PR adds a new repartition overload to Dataset[T] that accepts a key extraction function and a custom Partitioner, similar to RDD's partitionBy:

def repartition[K: Encoder](keyFunc: T => K, partitioner: Partitioner): Dataset[T]

Why are the changes needed?

Currently, Dataset users who want custom partitioning logic must drop down to the RDD API, losing the benefits of Catalyst optimization and the typed Dataset API.

Custom partitioning logic could be useful when:

You want to co-partition two datasets by the same key so that joins don't require a shuffle
You need custom bucketing logic beyond what HashPartitioner provides

The RDD API has supported custom partitioners via partitionBy since Spark's early days. This PR brings the same capability to the Dataset API.

Does this PR introduce any user-facing change?

Yes. Adds a new public API method to Dataset:

def repartition[K: Encoder](keyFunc: T => K, partitioner: Partitioner): Dataset[T]

How was this patch tested?

Added unit tests in PlannerSuite.scala covering basic functionality of new API.

Was this patch authored or co-authored using generative AI tooling?

Co-Generated-by: Cursor 2.1.46

…artitioner Add repartition overload that accepts a key function and custom Partitioner: def repartition[K: Encoder](keyFunc: T => K, partitioner: Partitioner): Dataset[T] This brings RDD's partitionBy capability to the Dataset API.

HyukjinKwon · 2025-12-07T22:35:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+ * The key extraction function is applied to deserialize each row and extract a key,
+ * which is then passed to the partitioner to determine the target partition.
+ */
+case class CustomFunctionPartitioning(


Isn't #52153 enough to cover the custom partition case?

Thanks for flagging, I wasn’t aware of #52153 when I put this together. Just read through it.

It looks like repartitionById covers cases where the partition logic can be expressed as a column expression, which handles a lot of use cases cleanly.

The gap I was thinking about is reusing existing Partitioner implementations from RDD codebases, or cases where the logic is complex enough that encapsulating it in a testable class is preferable to inline expressions. But I can see an argument that those are niche enough that repartitionById is sufficient.

Curious whether there’s appetite for supporting both patterns or if the consensus is that this isn’t needed. Happy to close if so.

github-actions bot added the SQL label Dec 7, 2025

yushesp closed this Dec 7, 2025

yushesp reopened this Dec 7, 2025

HyukjinKwon reviewed Dec 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-27853][SQL] Enable custom partitioning logic for Dataset via Partitioner #53375

[SPARK-27853][SQL] Enable custom partitioning logic for Dataset via Partitioner #53375

Uh oh!

yushesp commented Dec 7, 2025

Uh oh!

HyukjinKwon Dec 7, 2025

Uh oh!

yushesp Dec 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-27853][SQL] Enable custom partitioning logic for Dataset via Partitioner #53375

Are you sure you want to change the base?

[SPARK-27853][SQL] Enable custom partitioning logic for Dataset via Partitioner #53375

Uh oh!

Conversation

yushesp commented Dec 7, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

yushesp Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants