Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve generating and writing out changes in Merge #1854

Closed

Conversation

johanl-db
Copy link
Collaborator

Note: This PR is based on #1851 and #1852 and includes the corresponding changes in the first two commits. Actual change can be seen starting from commit Improve writing changes out in Merge

This changes is part of a larger effort to improve merge performance, see #1827

Description

This change rewrites the way modified data is written out in merge to improve performance. writeAllChanges now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows.

Changes:

  • Move findTouchedFiles and writeAllchanges to a dedicated new trait ClassicMergeExecutor implementing the regular merge path when InsertOnlyMergeExecutor is not used.
  • Introduce methods in MergeOutputGeneration to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data).

How was this patch tested?

This change fully preserve the behavior of merge which is extensively tested in MergeIntoSuiteBase, MergeIntoSQLSuite, MergeIntoScalaSuite, MergeCDCSuite, MergeIntoMetricsBase, MergeIntoNotMatchedBySourceSuite.

scottsand-db pushed a commit to scottsand-db/delta that referenced this pull request Jun 26, 2023
This changes is part of a larger effort to improve merge performance, see delta-io#1827

## Description
This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows.

Changes:
- Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used.
- Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data).

This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`.

Closes delta-io#1854

GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e
scottsand-db pushed a commit that referenced this pull request Jun 26, 2023
This changes is part of a larger effort to improve merge performance, see #1827

## Description
This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows.

Changes:
- Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used.
- Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data).

This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`.

Closes #1854

GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant