[SPARK-53696][PYTHON]Default to `bytes` for `BinaryType` in PySpark UDF #52438

xianzhe-databricks · 2025-09-24T13:41:04Z

What changes were proposed in this pull request?

Based on #52370, I propose to default the type for BINARY in PySpark UDF to python bytes, instead of bytearray.

Provide a spark conf to opt out of this change.

Why are the changes needed?

bytes is is immutable, hashable, and requires zero copy from Arrow. It is more performant than bytearray.

Does this PR introduce any user-facing change?

Yes. Binary data will be treated as bytes. Any attempt to modify the input data will fail as bytes is immutable.

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

No

github-actions bot added SQL DOCS CORE PYTHON AVRO labels Sep 24, 2025

xianzhe-databricks changed the title ~~first commit~~ [WIP]Default to bytes for BinaryType in PySpark UDF Sep 24, 2025

xianzhe-databricks changed the title ~~[WIP]Default to bytes for BinaryType in PySpark UDF~~ [WIP]Default to bytes for Binary type in PySpark UDF Sep 24, 2025

xianzhe-databricks changed the title ~~[WIP]Default to bytes for Binary type in PySpark UDF~~ [SPARK-53696]Default to bytes for Binary type in PySpark UDF Sep 24, 2025

xianzhe-databricks changed the title ~~[SPARK-53696]Default to bytes for Binary type in PySpark UDF~~ [SPARK-53696][PySpark][SQL]Default to bytes for Binary type in PySpark UDF Sep 24, 2025

xianzhe-databricks marked this pull request as ready for review September 24, 2025 14:27

first commit

ee6c271

xianzhe-databricks force-pushed the binary-type-defaults-to-bytes branch from 522fceb to ee6c271 Compare September 24, 2025 14:33

xianzhe-databricks changed the title ~~[SPARK-53696][PySpark][SQL]Default to bytes for Binary type in PySpark UDF~~ [SPARK-53696][PySpark][SQL]Default to bytes for BinaryType in PySpark UDF Sep 24, 2025

fix tests

d369e84

xianzhe-databricks changed the title ~~[SPARK-53696][PySpark][SQL]Default to bytes for BinaryType in PySpark UDF~~ [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53696][PYTHON]Default to `bytes` for `BinaryType` in PySpark UDF #52438

[SPARK-53696][PYTHON]Default to `bytes` for `BinaryType` in PySpark UDF #52438

xianzhe-databricks commented Sep 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF #52438

Are you sure you want to change the base?

[SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF #52438

Conversation

xianzhe-databricks commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

[SPARK-53696][PYTHON]Default to `bytes` for `BinaryType` in PySpark UDF #52438

[SPARK-53696][PYTHON]Default to `bytes` for `BinaryType` in PySpark UDF #52438

xianzhe-databricks commented Sep 24, 2025 •

edited

Loading