Probe whether the metadata path is canonicalized in Spark (#1725) #1770

xupefei · 2023-05-17T11:32:22Z

Description

(Cherry-pick of e0f0e91 to branch-2.4)

This issue fixes #1725.

The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column.

Why don't use the Spark internal method on FileToDvDescriptor, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases.

How was this patch tested?

Existing tests didn't fail.

ryan-johnson-databricks

How long will we need to keep the split code path? And how can we test both sides meanwhile?

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala

vkorukanti · 2023-07-10T13:46:25Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala

+    }.toMap
+  }
+
+  /**


Lets add a TODO to remove when Delta upgrades to Spark 3.5.

The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column. Why don't use the Spark internal method on `FileToDvDescriptor`, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases. Closes #1769. PR targeting `branch-2.4`: #1770. GitOrigin-RevId: 3538b18ff23e81c603acbc7df3930ba1730903f2

felipepessoto · 2023-07-24T18:45:46Z

That’s great.
@xupefei @vkorukanti should we expect the same 2x perf improvement we had in 3.0?

And Spark 3.4.1 compatibility?

“Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).”

vkorukanti · 2023-07-24T21:12:29Z

That’s great. @xupefei @vkorukanti should we expect the same 2x perf improvement we had in 3.0?

I think this is the case. @xupefei Please correct if not the case.

And Spark 3.4.1 compatibility?

For the DELETE with DV, this is correct. In general for 3.4.1, we may need some more testing/work.

“Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).”

felipepessoto · 2023-07-24T22:42:19Z

Are you aware of any other Spark 3.4.1 compatibility issue? Or just needs more bake time?

Thanks.

xupefei · 2023-08-14T07:55:46Z

I think this is the case. @xupefei Please correct if not the case.

No the perf improvement is also for Delta 3.0. Previously we always canonicalize the path regardless whether it has already been canonicalized, thus slow.

fix for Spark 3.4.1

00a568f

xupefei mentioned this pull request Jun 30, 2023

Improve DV path canonicalization #1829

Closed

rebase

fe6d530

ryan-johnson-databricks reviewed Jun 30, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala Outdated Show resolved Hide resolved

felipepessoto reviewed Jul 4, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala Show resolved Hide resolved

vkorukanti approved these changes Jul 10, 2023

View reviewed changes

address comments

a2f888b

ryan-johnson-databricks approved these changes Jul 19, 2023

View reviewed changes

vkorukanti changed the title ~~[2.4] Probe whether the metadata path is canonicalized in Spark (#1725)~~ Probe whether the metadata path is canonicalized in Spark (#1725) Jul 24, 2023

vkorukanti merged commit c6100ac into delta-io:branch-2.4 Jul 24, 2023
3 checks passed

xupefei deleted the dv-path-escape-2.4 branch August 14, 2023 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probe whether the metadata path is canonicalized in Spark (#1725) #1770

Probe whether the metadata path is canonicalized in Spark (#1725) #1770

xupefei commented May 17, 2023 •

edited by vkorukanti

Loading

ryan-johnson-databricks left a comment

vkorukanti Jul 10, 2023

felipepessoto commented Jul 24, 2023

vkorukanti commented Jul 24, 2023 •

edited

Loading

felipepessoto commented Jul 24, 2023

xupefei commented Aug 14, 2023 •

edited

Loading

+                  }.toMap
+                }
+                /**

Probe whether the metadata path is canonicalized in Spark (#1725) #1770

Probe whether the metadata path is canonicalized in Spark (#1725) #1770

Conversation

xupefei commented May 17, 2023 • edited by vkorukanti Loading

Description

How was this patch tested?

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

vkorukanti Jul 10, 2023

Choose a reason for hiding this comment

felipepessoto commented Jul 24, 2023

vkorukanti commented Jul 24, 2023 • edited Loading

felipepessoto commented Jul 24, 2023

xupefei commented Aug 14, 2023 • edited Loading

xupefei commented May 17, 2023 •

edited by vkorukanti

Loading

vkorukanti commented Jul 24, 2023 •

edited

Loading

xupefei commented Aug 14, 2023 •

edited

Loading