Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DV path canonicalization #1829

Closed
wants to merge 2 commits into from

Conversation

xupefei
Copy link
Contributor

@xupefei xupefei commented Jun 9, 2023

Description

This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up.

Future improvement is possible for handling huge logs: build canonicalizedPathMap in a distributed way.

How was this patch tested?

Existing tests.

@xupefei
Copy link
Contributor Author

xupefei commented Jun 13, 2023

@larsk-db Could you take a look at this PR?

Copy link
Contributor

@larsk-db larsk-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just a couple of nits

vkorukanti pushed a commit that referenced this pull request Jun 20, 2023
## Description

This PR improves the FILE_PATH canonicalization logic by avoiding calling expensive `Path.toUri.toString` calls for each row in a table. Canonicalized paths are now cached and the UDF just needs to look it up.

Future improvement is possible for handling huge logs: build `canonicalizedPathMap` in a distributed way.

Related PR target the 2.4 branch: #1829.

Existing tests.

Closes #1836

Signed-off-by: Paddy Xu <xupaddy@gmail.com>
GitOrigin-RevId: c4810852f9136c36ec21f3519620ca26ed12bb04
// Build two maps, using Path or String as keys. The one with String keys is used in UDF.
val canonicalizedPathMap = buildCanonicalizedPathMap(txn.deltaLog, candidateFiles)
val canonicalizedPathStringMap =
canonicalizedPathMap.map { case (k, v) => k.toString -> v }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this second pass, for a single-callsite helper method? Can it just return a string-string map directly, since that's what we ultimately broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! I addressed your comment in a later PR #1770.

@xupefei
Copy link
Contributor Author

xupefei commented Jun 30, 2023

Closed in favor of #1770.

@xupefei xupefei closed this Jun 30, 2023
@xupefei xupefei deleted the branch-2.4 branch June 30, 2023 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants