Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probe whether the metadata path is canonicalized in Spark (#1725) #1770

Merged
merged 3 commits into from
Jul 24, 2023

Conversation

xupefei
Copy link
Contributor

@xupefei xupefei commented May 17, 2023

Description

(Cherry-pick of e0f0e91 to branch-2.4)

This issue fixes #1725.

The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column.

Why don't use the Spark internal method on FileToDvDescriptor, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases.

How was this patch tested?

Existing tests didn't fail.

Copy link
Collaborator

@ryan-johnson-databricks ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long will we need to keep the split code path? And how can we test both sides meanwhile?

}.toMap
}

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a TODO to remove when Delta upgrades to Spark 3.5.

vkorukanti pushed a commit that referenced this pull request Jul 14, 2023
The mechanism of this fix is to call the Spark internal method, which is used to generate metadata columns, to see if it will canonicalize spaces in a crafted path string. If the answer is yes, then we don't need to do anything on the Delta side; otherwise, we manually canonicalize the obtained metadata column.

Why don't use the Spark internal method on `FileToDvDescriptor`, so both sides of the join are either canonicalized or not-canonicalized? Because most Delta methods are expecting a canonicalized path, thus the returned DF must be canonicalized in all cases.

Closes #1769.
PR targeting `branch-2.4`: #1770.

GitOrigin-RevId: 3538b18ff23e81c603acbc7df3930ba1730903f2
@vkorukanti vkorukanti changed the title [2.4] Probe whether the metadata path is canonicalized in Spark (#1725) Probe whether the metadata path is canonicalized in Spark (#1725) Jul 24, 2023
@vkorukanti vkorukanti merged commit c6100ac into delta-io:branch-2.4 Jul 24, 2023
3 checks passed
@felipepessoto
Copy link
Contributor

That’s great.
@xupefei @vkorukanti should we expect the same 2x perf improvement we had in 3.0?

And Spark 3.4.1 compatibility?

Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).”

@vkorukanti
Copy link
Collaborator

vkorukanti commented Jul 24, 2023

That’s great. @xupefei @vkorukanti should we expect the same 2x perf improvement we had in 3.0?

I think this is the case. @xupefei Please correct if not the case.

And Spark 3.4.1 compatibility?

For the DELETE with DV, this is correct. In general for 3.4.1, we may need some more testing/work.

Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).”

@felipepessoto
Copy link
Contributor

Are you aware of any other Spark 3.4.1 compatibility issue? Or just needs more bake time?

Thanks.

@xupefei xupefei deleted the dv-path-escape-2.4 branch August 14, 2023 07:51
@xupefei
Copy link
Contributor Author

xupefei commented Aug 14, 2023

I think this is the case. @xupefei Please correct if not the case.

No the perf improvement is also for Delta 3.0. Previously we always canonicalize the path regardless whether it has already been canonicalized, thus slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants