Skip to content

Commit

Permalink
Improve generating and writing out changes in Merge
Browse files Browse the repository at this point in the history
This changes is part of a larger effort to improve merge performance, see #1827

## Description
This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows.

Changes:
- Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used.
- Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data).

This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`.

Closes #1854

GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e
  • Loading branch information
johanl-db authored and scottsand-db committed Jun 26, 2023
1 parent 3c264fa commit ead1ff0
Show file tree
Hide file tree
Showing 5 changed files with 886 additions and 678 deletions.
Loading

0 comments on commit ead1ff0

Please sign in to comment.