Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [MERGE] Avoid copy rows when match clause is false. Potential for huge perf impact. #1812

Open
2 of 8 tasks
felipepessoto opened this issue Jun 2, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@felipepessoto
Copy link
Contributor

felipepessoto commented Jun 2, 2023

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

The method findTouchedFiles in MergeIntoCommand only filter files by the condition (ON clause). Rewriting the entire table even when the match clause is false.

Motivation

This is a big problem when you merge two big tables and match clause is mostly false, but ON clause matches most of the target table, like the example below.

Further details

def showMetrics(targetDataPathParam: String) {
  val deltaTable = io.delta.tables.DeltaTable.forPath(spark, targetDataPathParam)
   val operationMetrics = deltaTable.history(1).select("operationMetrics").head.toString()
   println(s"Metrics for $targetDataPathParam: \r\n$operationMetrics\r\n")
}

spark.conf.set("spark.sql.files.maxRecordsPerFile", 1)

val rows = 1000

val data = spark.range(rows).withColumn("name", lit("John"))

val df = data.toDF("id", "name")

df.write.format("delta").save(tablePath)

val source = data.toDF("id2", "name2")

source.createOrReplaceTempView("test_data")

spark.sql("MERGE INTO delta.`" + tablePath + """` t 
   USING test_data s
   ON s.id2 = t.id
   WHEN MATCHED AND NOT (s.name2 <=> t.name) THEN UPDATE SET t.name = s.name2
   WHEN NOT MATCHED THEN INSERT (id, name) VALUES (s.id2, s.name2)""")

showMetrics(tablePath)

Observed results

numTargetRowsCopied -> 1000
numOutputRows -> 1000
numTargetFilesRemoved -> 1000
numTargetFilesAdded -> 1000

Expected results

numTargetRowsCopied -> 0
numOutputRows -> 0
numTargetFilesRemoved -> 0
numTargetFilesAdded -> 0

Further details

Environment information

  • Delta Lake version: 2.2.0
  • Spark version: 3.3
  • Scala version: 2.12

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@keen85
Copy link

keen85 commented Apr 19, 2024

We just ran into this very bug using Delta 2.2.0 / 2.4.0.
I really think that this is something that should be fixed.

Our workaround is to include the update-condition into the merge-condition as well.

For us, this reduced execution time from 55 minutes to 10 minutes.

@keen85
Copy link

keen85 commented Apr 19, 2024

@johanl-db can you tell if this issue is covered by #1827?

@johanl-db
Copy link
Collaborator

johanl-db commented Apr 19, 2024

@keen85 If your merge only contains whenMatched update/delete clauses (and no whenNotMatched or whenNotMatchedBySource clauses) then it will benefit from #1851 (part of a series of improvements in #1827). You will need to upgrade to Delta 3.0 or above to benefit from that change.

@keen85
Copy link

keen85 commented Apr 19, 2024

thanks a lot @johanl-db.

Should this issue then be marked as resolved?

@felipepessoto felipepessoto changed the title [BUG] Merge copy rows even when match clause is false. Potential for huge perf impact. [Feature Request] [MERGE] Avoid copy rows when match clause is false. Potential for huge perf impact. Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants