-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probe whether the metadata path is canonicalized in Spark (#1725) #1770
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long will we need to keep the split code path? And how can we test both sides meanwhile?
core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala
Show resolved
Hide resolved
}.toMap | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a TODO to remove when Delta upgrades to Spark 3.5.
The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column. Why don't use the Spark internal method on `FileToDvDescriptor`, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases. Closes #1769. PR targeting `branch-2.4`: #1770. GitOrigin-RevId: 3538b18ff23e81c603acbc7df3930ba1730903f2
That’s great. And Spark 3.4.1 compatibility? “Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).” |
I think this is the case. @xupefei Please correct if not the case.
For the DELETE with DV, this is correct. In general for 3.4.1, we may need some more testing/work.
|
Are you aware of any other Spark 3.4.1 compatibility issue? Or just needs more bake time? Thanks. |
No the perf improvement is also for Delta 3.0. Previously we always canonicalize the path regardless whether it has already been canonicalized, thus slow. |
Description
(Cherry-pick of e0f0e91 to branch-2.4)
This issue fixes #1725.
The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column.
Why don't use the Spark internal method on
FileToDvDescriptor
, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases.How was this patch tested?
Existing tests didn't fail.