-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DV path canonicalization #1829
Conversation
@larsk-db Could you take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just a couple of nits
core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala
Outdated
Show resolved
Hide resolved
## Description This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive `Path.toUri.toString` calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up. Future improvement is possible for handling huge logs: build `canonicalizedPathMap` in a distributed way. Related PR target the 2.4 branch: #1829. Existing tests. Closes #1836 Signed-off-by: Paddy Xu <xupaddy@gmail.com> GitOrigin-RevId: c4810852f9136c36ec21f3519620ca26ed12bb04
// Build two maps, using Path or String as keys. The one with String keys is used in UDF. | ||
val canonicalizedPathMap = buildCanonicalizedPathMap(txn.deltaLog, candidateFiles) | ||
val canonicalizedPathStringMap = | ||
canonicalizedPathMap.map { case (k, v) => k.toString -> v } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this second pass, for a single-callsite helper method? Can it just return a string-string map directly, since that's what we ultimately broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! I addressed your comment in a later PR #1770.
Closed in favor of #1770. |
Description
This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive
Path.toUri.toString
calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up.Future improvement is possible for handling huge logs: build
canonicalizedPathMap
in a distributed way.How was this patch tested?
Existing tests.