[Feature Request] Merge performance improvements #1827

johanl-db · 2023-06-09T12:16:23Z

Feature request

Overview

This feature request covers a number of performance improvements for the MERGE INTO command.

Motivation

I have a prototype for improvements to the MERGE command that shows up to 2x faster execution with the biggest improvements on delete only merge and merge touching a lot of files. The results have been gathered using new benchmark tests cases for merge that can be added to the existing benchmarks.

Performance improvement:

Test case	Base duration (s)	Test duration (s)	Improvement ratio
delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05	26,1	20,5	1,27
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05	8,8	15,2	0,58
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5	27,7	17,5	1,58
multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0	36,3	21,2	1,71
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05	14,9	14,8	1,01
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5	17,5	17,3	1,01
single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0	20,3	20,7	0,98
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1	39,5	28,8	1,37
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1	19,9	19,3	1,03
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0	39,1	29,9	1,31
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01	39,1	31	1,26
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001	41,9	32,5	1,29
upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001	43,3	33,8	1,28
upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001	43,8	34,1	1,28
upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001	147,9	84,8	1,74
upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001	266,9	142,5	1,87

Further details

The following improvements are contributing to better overall performance:

Data skipping for matched only merges using matched condition: In the case where the merge statement only includes MATCHED clauses, we can benefit from data skipping using the target-only MATCHED conditions when scanning for the target for matches. For example, with:

MERGE INTO target
USING source
ON target.key = source.key
WHEN MATCHED AND target.value = 5 AND source.value = 6
WHEN MATCHED AND target.value = 7

we can skip scanning files using predicate: target.value =5 OR target.value = 7

Dedicated path for insert only merges with more than 1 insert clause: The current code path for insert-only only supports 1 NOT MATCHED clause. We can extend it to support any number of NOT MATCHED clause.

More efficient classic merge execution: We can improve overall merge performance by optimizing the way we write modified rows: instead of processing individual partitions, we can build a large single expression to process all rows at once.

Native counter for metrics instead of UDFs: We use UDFs to increment metrics across MERGE/UPDATE/DELETE. We can instead introduce a dedicated native expression that can be codegened.

Project Plan

Task	Description	Status	PR
Merge benchmark	Extend the existing benchmarks with test cases for MERGE	In Review	#1835
Improve Insert-only merge execution	Allow handling of multiple NOT MATCHED clause in the insert-only path	Done	#1852
Rewrite classic merge execution	The performance of the regular code path to execute merge can be improved by rewriting the way rows are processed	Done	#1854
Native metric counters	Replace the use of UDFs by native expressions to increment metrics	Done	#1828
Data skipping for MATCHED only merge	When they are only MATCHED clauses, the MATCHED condition can be used to skip data when scanning the target for matches.	Done	#1851

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

The text was updated successfully, but these errors were encountered:

tdas · 2023-06-09T16:07:01Z

This is fantastic. Thats pretty good bump in performance!

felipepessoto · 2023-06-09T17:54:08Z

Great!

Related #1812

## Description This change is a plain refactor that will help future work to improve overall merge performance, see #1827 It creates a base merge class that gather functionalities that are shared by the current insert-only and classic merge code paths to allow splitting them in a following PR. Shared code to collect merge statistics is added there already. This is a non-functional refactor covered by extensive merge tests, e.p. MergeIntoMetricsBase tests. Closes #1834 GitOrigin-RevId: caf346b4136e6738e30bd15219eaaeabbd833bd5

This changes is part of a larger effort to improve merge performance, see #1827 ## Description This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows. Changes: - Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used. - Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data). This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`. Closes #1854 GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e

This changes is part of a larger effort to improve merge performance, see delta-io#1827 ## Description This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows. Changes: - Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used. - Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data). This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`. Closes delta-io#1854 GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e

This changes is part of a larger effort to improve merge performance, see #1827 ## Description This change rewrites the way modified data is written out in merge to improve performance. `writeAllChanges` now generates a dataframe containing all the updated and copied rows to write out by building a large expression that selectively applies the right merge action to each row. This replaces the previous method that relied on applying a function to individual rows. Changes: - Move `findTouchedFiles` and `writeAllchanges` to a dedicated new trait `ClassicMergeExecutor` implementing the regular merge path when `InsertOnlyMergeExecutor` is not used. - Introduce methods in `MergeOutputGeneration` to transform the merge clauses into expressions that can be applied to generate the output of the merge operation (both main data and CDC data). This change fully preserve the behavior of merge which is extensively tested in `MergeIntoSuiteBase`, `MergeIntoSQLSuite`, `MergeIntoScalaSuite`, `MergeCDCSuite`, `MergeIntoMetricsBase`, `MergeIntoNotMatchedBySourceSuite`. Closes #1854 GitOrigin-RevId: d8c8a0e9439c6710978f2ec345cb94b2b9b19e0e

scottsand-db · 2023-06-26T18:08:03Z

Closed by #1854!

felipepessoto · 2023-06-26T20:49:05Z

@johanl-db @scottsand-db, this issue is closed, are you still proceeding with #1851?

johanl-db · 2023-06-27T07:53:22Z

@johanl-db @scottsand-db, this issue is closed, are you still proceeding with #1851?

#1852 was stacked on top of it but it got merged first, picking up the changes from #1851 in the process which wasn't intentional.

That does mean the changes from #1851 are effectively merged and we can close the PR though we end up with a single commit for both #1851 and #1852.

This PR extends the existing benchmarks with new test cases dedicated to the MERGE INTO command, with two scale factors: 1GB & 3TB. The following type of test cases are created and can be extended in the future: - `SingleInsertOnlyTestCase` - `MultipleInsertOnlyTestCase` - `DeleteOnlyTestCase` - `UpsertTestCase` Each test case uses the same (cloned) target table and defines its source table using the following parameters: - `fileMatchedFraction`: The fraction of target files sampled to create the source table. - `rowMatchedFraction: The fraction of rows sampled in each selected target file to form the rows that match the `ON` condition. - `rowNotMatchedFraction`: The fraction of rows sampled in each selected target file to form the rows that don't match the `ON` condition. The target and source tables are created using the `merge-1gb-delta-load`/`merge-3tb-delta-load`, which collect all the source table configurations used in merge test cases and creates the required source tables. This benchmark is added to measure the impact of a series of changes to the merge command, see #1827 I followed the instructions in https://github.com/delta-io/delta/tree/master/benchmarks to create an EMR cluster and run the new benchmarks. Here are the result comparing the impact of #1827: Test case | Base duration (s) | Test duration (s) | Improvement ratio -- | -- | -- | -- delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05 | 26,1 | 20,5 | 1,27 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 8,8 | 15,2 | 0,58 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 27,7 | 17,5 | 1,58 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 36,3 | 21,2 | 1,71 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 14,9 | 14,8 | 1,01 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 17,5 | 17,3 | 1,01 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 20,3 | 20,7 | 0,98 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1 | 39,5 | 28,8 | 1,37 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1 | 19,9 | 19,3 | 1,03 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0 | 39,1 | 29,9 | 1,31 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01 | 39,1 | 31 | 1,26 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001 | 41,9 | 32,5 | 1,29 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 | 43,3 | 33,8 | 1,28 upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001 | 43,8 | 34,1 | 1,28 upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 147,9 | 84,8 | 1,74 upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 266,9 | 142,5 | 1,87 Closes #1835 GitOrigin-RevId: 443099e8a02b98fffe5e5a9ec2cecb5d3b8f9537

This PR extends the existing benchmarks with new test cases dedicated to the MERGE INTO command, with two scale factors: 1GB & 3TB. The following type of test cases are created and can be extended in the future: - `SingleInsertOnlyTestCase` - `MultipleInsertOnlyTestCase` - `DeleteOnlyTestCase` - `UpsertTestCase` Each test case uses the same (cloned) target table and defines its source table using the following parameters: - `fileMatchedFraction`: The fraction of target files sampled to create the source table. - `rowMatchedFraction: The fraction of rows sampled in each selected target file to form the rows that match the `ON` condition. - `rowNotMatchedFraction`: The fraction of rows sampled in each selected target file to form the rows that don't match the `ON` condition. The target and source tables are created using the `merge-1gb-delta-load`/`merge-3tb-delta-load`, which collect all the source table configurations used in merge test cases and creates the required source tables. This benchmark is added to measure the impact of a series of changes to the merge command, see delta-io#1827 I followed the instructions in https://github.com/delta-io/delta/tree/master/benchmarks to create an EMR cluster and run the new benchmarks. Here are the result comparing the impact of delta-io#1827: Test case | Base duration (s) | Test duration (s) | Improvement ratio -- | -- | -- | -- delete_only_fileMatchedFraction_0.05_rowMatchedFraction_0.05 | 26,1 | 20,5 | 1,27 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 8,8 | 15,2 | 0,58 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 27,7 | 17,5 | 1,58 multiple_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 36,3 | 21,2 | 1,71 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.05 | 14,9 | 14,8 | 1,01 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_0.5 | 17,5 | 17,3 | 1,01 single_insert_only_fileMatchedFraction_0.05_rowNotMatchedFraction_1.0 | 20,3 | 20,7 | 0,98 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.01_rowNotMatchedFraction_0.1 | 39,5 | 28,8 | 1,37 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.0_rowNotMatchedFraction_0.1 | 19,9 | 19,3 | 1,03 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.0 | 39,1 | 29,9 | 1,31 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.1_rowNotMatchedFraction_0.01 | 39,1 | 31 | 1,26 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.5_rowNotMatchedFraction_0.001 | 41,9 | 32,5 | 1,29 upsert_fileMatchedFraction_0.05_rowMatchedFraction_0.99_rowNotMatchedFraction_0.001 | 43,3 | 33,8 | 1,28 upsert_fileMatchedFraction_0.05_rowMatchedFraction_1.0_rowNotMatchedFraction_0.001 | 43,8 | 34,1 | 1,28 upsert_fileMatchedFraction_0.5_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 147,9 | 84,8 | 1,74 upsert_fileMatchedFraction_1.0_rowMatchedFraction_0.01_rowNotMatchedFraction_0.001 | 266,9 | 142,5 | 1,87 Closes delta-io#1835 GitOrigin-RevId: 443099e8a02b98fffe5e5a9ec2cecb5d3b8f9537

zhaner08 · 2024-01-18T18:27:28Z

What is the setup you used to produce the mentioned gain? Ex. your instance sizes and any configurations overwrite?

johanl-db added the enhancement New feature or request label Jun 9, 2023

This was referenced Jun 13, 2023

Refactor collecting merge stats #1834

Closed

Add Merge benchmarks #1835

Closed

This was referenced Jun 21, 2023

Skip data in merge based on MATCHED conditions #1851

Closed

Improve generating and writing out changes in Merge #1854

Closed

scottsand-db closed this as completed Jun 26, 2023

felipepessoto mentioned this issue Jun 27, 2023

[QUESTION] Looking for Guidance on MERGE Performance Improvements #1790

Open

keen85 mentioned this issue Apr 19, 2024

[Feature Request] [MERGE] Avoid copy rows when match clause is false. Potential for huge perf impact. #1812

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Merge performance improvements #1827

[Feature Request] Merge performance improvements #1827

johanl-db commented Jun 9, 2023 •

edited

Loading

tdas commented Jun 9, 2023

felipepessoto commented Jun 9, 2023

scottsand-db commented Jun 26, 2023

felipepessoto commented Jun 26, 2023

johanl-db commented Jun 27, 2023

zhaner08 commented Jan 18, 2024

[Feature Request] Merge performance improvements #1827

[Feature Request] Merge performance improvements #1827

Comments

johanl-db commented Jun 9, 2023 • edited Loading

Feature request

Overview

Motivation

Further details

Project Plan

Willingness to contribute

tdas commented Jun 9, 2023

felipepessoto commented Jun 9, 2023

scottsand-db commented Jun 26, 2023

felipepessoto commented Jun 26, 2023

johanl-db commented Jun 27, 2023

zhaner08 commented Jan 18, 2024

johanl-db commented Jun 9, 2023 •

edited

Loading