Skip to content

Conversation

xianzhe-databricks
Copy link

@xianzhe-databricks xianzhe-databricks commented Sep 24, 2025

What changes were proposed in this pull request?

Based on #52370, I propose to default the type for BINARY in PySpark UDF to python bytes, instead of bytearray.

Provide a spark conf to opt out of this change.

Why are the changes needed?

bytes is is immutable, hashable, and requires zero copy from Arrow. It is more performant than bytearray.

Does this PR introduce any user-facing change?

Yes. Binary data will be treated as bytes. Any attempt to modify the input data will fail as bytes is immutable.

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

No

@xianzhe-databricks xianzhe-databricks changed the title first commit [WIP]Default to bytes for BinaryType in PySpark UDF Sep 24, 2025
@xianzhe-databricks xianzhe-databricks changed the title [WIP]Default to bytes for BinaryType in PySpark UDF [WIP]Default to bytes for Binary type in PySpark UDF Sep 24, 2025
@xianzhe-databricks xianzhe-databricks changed the title [WIP]Default to bytes for Binary type in PySpark UDF [SPARK-53696]Default to bytes for Binary type in PySpark UDF Sep 24, 2025
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696]Default to bytes for Binary type in PySpark UDF [SPARK-53696][PySpark][SQL]Default to bytes for Binary type in PySpark UDF Sep 24, 2025
@xianzhe-databricks xianzhe-databricks marked this pull request as ready for review September 24, 2025 14:27
@xianzhe-databricks xianzhe-databricks force-pushed the binary-type-defaults-to-bytes branch from 522fceb to ee6c271 Compare September 24, 2025 14:33
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PySpark][SQL]Default to bytes for Binary type in PySpark UDF [SPARK-53696][PySpark][SQL]Default to bytes for BinaryType in PySpark UDF Sep 24, 2025
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PySpark][SQL]Default to bytes for BinaryType in PySpark UDF [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant