[Spark][1.0] Fix a data loss bug in MergeIntoCommand #2128

sezruby · 2023-09-30T18:40:14Z

Which Delta project/connector is this regarding?

Description

Fix a data loss bug in MergeIntoCommand.
It's caused by using different spark session config object for PreprocessTableMerge and MergeIntoCommand. In PreprocessTableMerge, the config value is from the spark session which is given for DeltaTable. In MergeIntoSchema it refers SQLConf.get which is from the active session of the current thread. It can be different when user set another new spark session or active session just doesn't set properly before the execution.

If source dataframe has more columns than target table, auto schema merge feature adds additional nullable column to
target table schema. The updated output projection built in PreprocessTableMerge, so matchedClauses and notMatchedClauses contains the addtional columns, but target table schema in MergeIntoCommand doesn't have it.

As a result, the following index doesn't indicate the delete flag column index, which is numFields - 2.

      def shouldDeleteRow(row: InternalRow): Boolean =
        row.getBoolean(outputRowEncoder.schema.fields.size)

row.getBoolean returns getByte() != 0, which causes dropping rows randomly.

matched rows in target table loss

Also as autoMerge doesn't work

newly added column data in source df loss.

The fix makes sure setting active session as the given spark session for target table. The PR applies the fix for other command to avoid any inconsistent config behavior.

Fixes #2104

How was this patch tested?

I confirmed that #2104 is fixed with the change.
I confirmed the following by debug log message without the change:

matchedClauses has more columns after processRow
row.getBoolean(outputRowEncoder.schema.fields.size) refers random column value (It's Unsafe read)
canMergeSchema in MergeIntoCommand is false, it was true in PreprocessTableMerge

Does this PR introduce any user-facing changes?

Yes, fixes the data loss issue

sezruby · 2023-09-30T19:22:41Z

Seems the fix is needed for all other versions.

in MergeIntoCommand

override val canMergeSchema: Boolean = conf.getConf(DeltaSQLConf.DELTA_SCHEMA_AUTO_MIGRATE)

conf is SQLConf.get from QueryPlan - LogicalPlan - Command - RunnableCommand

vkorukanti · 2023-09-30T19:33:17Z

@johanl-db Could you please take a look into this?

johanl-db · 2023-10-02T07:57:31Z

Thanks @sezruby for contributing a fix.
A few questions regarding the issue to help me understand better what's happening:

Are you able to reproduce it on other Delta versions than 1.0?
Is there a way to turn the repro you have into a test that can be added to this PR to validate the fix?
Is the value of spark.databricks.delta.schema.autoMerge.enabled updated while the MERGE operation is running?
Do you have a concrete example of what you mean by multiple spark sessions running concurrently":

It's caused by using different spark session config object for PreprocessTableMerge and MergeIntoCommand, which is possible when multiple spark sessions are running concurrently.

This pattern of accessing the spark conf is fairly widespread, I suspect a lot of things would break if there was an issue with it. You're seeing different values for DELTA_SCHEMA_AUTO_MIGRATE but can you check if PreprocessTableMerge and MergeIntoCommand are using two different conf objects or if it's the same object but the DELTA_SCHEMA_AUTO_MIGRATE value changed between the two?

sezruby · 2023-10-02T17:07:34Z

1 => We have logs and I can find the case in spark 3.2/1.2 too, using

| where (numTargetCopiedRows + numTargetUpdatedRows) > 0 
 and numTargetDeletedRows == 0 and numOutputRows != (numTargetCopiedRows + numTargetUpdatedRows + numTargetInsertedRows)

I think the issue exists higher version too, as it's the same code.

2, 4 => no repro test, but I guess the repro condition is

triggering MERGE using multiple spark sessions concurrently (5~10 count)
first session has the config set, others not.
setActiveSession called for other sessions before SQLConf.get for the first session.

3 => I need to confirm that each MERGE request uses the same spark session or different spark sessions(I guess it's different spark sessions). However, both case could be the problem. We should throw the exception if the config is changed within the same spark session, too.

The problem is, SQLConf.get refers cached conf, which is from getActiveSession()
https://github.com/apache/spark/blob/cdbb301143de2e9a0ea525d20867948f49863842/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L117

But the active session can be changed before SQLConf is cached for the thread.

Yes I know many places use the conf from the active session. If the config affects correctness, we should avoid that.

sezruby · 2023-10-02T22:14:36Z

I confirmed they are using multi threads (python) to run MERGE command to tables concurrently, but not changing the config nor creating a new session for each thread.
However, there could be still an issue from setting/resetting active session while running queries.

https://github.com/apache/spark/blob/4a418a448eab6e1007927db92ecacad6594397c8/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L1107

I suspect default session is set when one of threads is exiting withActive.
https://github.com/apache/spark/blob/4a418a448eab6e1007927db92ecacad6594397c8/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L769

clee704 · 2023-10-03T22:17:23Z

Here's pyspark repro:

# repro.py
import tempfile
from multiprocessing.pool import ThreadPool

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from delta.tables import DeltaTable

spark = SparkSession.builder.appName("Test").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

with tempfile.TemporaryDirectory() as temp_dir:
    source_path = f"{temp_dir}/source"
    target_path = f"{temp_dir}/target"

    def f(spark):
        print(f"spark = {spark}")
        print(f"active session = {SparkSession.getActiveSession()}")
        print(f"autoMerge from SQLConf.get = {spark._jvm.org.apache.spark.sql.internal.SQLConf.get().getConfString('spark.databricks.delta.schema.autoMerge.enabled', '<not set>')}")
        spark.range(20) \
            .withColumn("x", col("id")) \
            .withColumn("y", col("id")) \
            .write.mode("overwrite").format("delta").save(source_path)
        spark.range(1) \
            .withColumn("x", col("id")) \
            .write.mode("overwrite").format("delta").save(target_path)
        # Idk why but DeltaTable.forPath(spark, target_path) fails
        target = DeltaTable(spark, spark._jvm.io.delta.tables.DeltaTable.forPath(spark._jsparkSession, target_path))
        source = spark.read.format("delta").load(source_path).alias("s")
        target.alias("t") \
            .merge(source, "t.id = s.id") \
            .whenMatchedUpdate(set = {"t.x": "t.x + 1"}) \
            .whenNotMatchedInsertAll() \
            .execute()
        spark.read.format("delta").load(target_path).show()
        assert spark.read.format("delta").load(target_path).count() == 20

    pool = ThreadPool(1)
    pool.starmap(f, [(spark,)])

Run as:

./bin/spark-submit --packages io.delta:delta-core_2.12:2.2.0 --conf spark.databricks.delta.schema.autoMerge.enabled=true repro.py

This is because a new thread is created in the JVM process and that thread doesn't inherit any ThreadLocal variables, including the active session ref.

I think a better fix (if we want to minimize changes) is wrapping DeltaMergeBuilder.execute() with targetTable.toDF.sparkSession.withActive but withActive is private. We can make a similar util in Delta.

A more fundamental fix would be to remove all references to the active session and ensure we only reference the same SparkSession throughout the operation.

Here's Scala repro:

withTempDir { dir =>
  val r = dir.getCanonicalPath
  val sourcePath = s"$r/source"
  val targetPath = s"$r/target"
  val spark2 = spark.newSession
  spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
  spark.range(20)
    .withColumn("x", $"id")
    .withColumn("y", $"id")
    .write.mode("overwrite").format("delta").save(sourcePath)
  spark.range(1)
    .withColumn("x", $"id")
    .write.mode("overwrite").format("delta").save(targetPath)
  val target = io.delta.tables.DeltaTable.forPath(spark, targetPath)
  val source = spark.read.format("delta").load(sourcePath).alias("s")
  val merge = target.alias("t")
    .merge(source, "t.id = s.id")
    .whenMatched.updateExpr(Map("t.x" -> "t.x + 1"))
    .whenNotMatched.insertAll()
  SparkSession.setActiveSession(spark2)
  merge.execute()
  assert(spark.read.format("delta").load(targetPath).count() == 20)
}

vkorukanti · 2023-10-03T22:24:49Z

cc @scottsand-db for 3.0

johanl-db · 2023-10-04T09:00:46Z

Rephrasing your findings @clee704, let me know if you disagree:

So the issue is essentially that in DeltaMergeBuilder.execute(), we use the spark session from the target table that gets propagated to the analysis step and is ultimately the one used here to get the schema evolution conf value. DeltaMergeBuilder.execute():

    val sparkSession = targetTable.toDF.sparkSession
    val resolvedMergeInto =
      DeltaMergeInto.resolveReferencesAndSchema(mergePlan, sparkSession.sessionState.conf)(
        tryResolveReferencesForExpressions(sparkSession))

but in MergeIntoCommand, we get the value from SQLConf which requires the conf to be set in the thread running that code.

I agree with wrapping execution with targetTable.toDF.sparkSession.withActive ( and replicating the existing spark util to make it available in DeltaMergeBuilder) . This is more robust compared to the proposed fix as there may be other situations outside of schema evolution where the same inconsistencies appear.

Other commands may also suffer from the same issue, we should review if UPDATE/INSERT, DELETE, or utility commands need the same fix

Signed-off-by: Eunjin Song <sezruby@gmail.com> Co-authored-by: Chungmin Lee <lee@chungmin.dev>

clee704 · 2023-10-04T22:10:29Z

@johanl-db Yes, that's a good summary. For other potential issues, we can make a follow up.

sezruby · 2023-10-05T02:39:37Z

@johanl-db Any delivery plan from Databricks by any chance?
I can revert changes for other command if we need to do it with following up PRs + additional tests. I think the issue exists for all versions, users are being affected silently.

python/delta/tests/test_deltatable.py

johanl-db

@sezruby The change looks good, we should move forward with it and cover all Delta command execution that we identify.
Having a test for each of them would be nice to have if you can but the tests you already have already highlight the issue well enough to justify this fix.

AFAICT, the issue is present in all Delta versions but for MERGE specifically it no longer silently produces incorrect results but fails instead starting with ead1ff0 which will be included with the next release.

I'm not very familiar with our backport and release policy for older delta branches, I'll need to check. You may want to target master with this PR first and then we can work on backporting as needed.

core/src/main/scala/io/delta/tables/execution/DeltaTableOperations.scala

core/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala

python/delta/tests/test_deltatable.py

core/src/main/scala/io/delta/tables/DeltaTable.scala

core/src/main/scala/org/apache/spark/sql/delta/DeltaTable.scala

core/src/main/scala/io/delta/tables/execution/DeltaConvert.scala

johanl-db

LGTM!

@scottsand-db This is a bug that impacts all delta versions, what's the process to get the fix to existing branches?

johanl-db · 2023-10-10T16:02:10Z

I created cherry-picks to all 2.x branches and master:

@sezruby do you mind quickly going over them to check that I didn't mess anything?
@scottsand-db you can merge this PR against 1.0, then once the other cherry-picks are reviewed we can merge them also

johanl-db · 2023-10-12T15:32:46Z

@sezruby I updated the changes against all branches except 1.0 to cover executeDetails, executeRestore, executeGenerate and optimize. Can you update this PR against 1.0 before we merge it?

sezruby · 2023-10-12T17:31:05Z

@sezruby I updated the changes against all branches except 1.0 to cover executeDetails, executeRestore, executeGenerate and optimize. Can you update this PR against 1.0 before we merge it?

I updated for executeGenerate. No other command available in 1.0

felipepessoto · 2023-10-26T23:34:11Z

This is because a new thread is created in the JVM process and that thread doesn't inherit any ThreadLocal variables, including the active session ref.

@johanl-db, this PR, #2154 is related to this issue?

johanl-db · 2023-10-27T08:02:39Z

This is because a new thread is created in the JVM process and that thread doesn't inherit any ThreadLocal variables, including the active session ref.

@johanl-db, this PR, #2154 is related to this issue?

Same category of issues but these are actually unrelated:

Propagate thread locals to Delta thread pools #2154: whenever we start a new thread internally, we need to forward various thread-local properties which DeltaThreadPool wasn't doing. I believe @fred-db found out about this while working on [Spark] Parallel stats collection within each partition #2203.
This PR: addresses a gap in Delta table APIs that let a user pass a spark session when creating an operation that is different than the (thread-local) active session used to later execute it. Follows from a bug report.

Short update from my side on this PR since it has been stalled for a while: it's still on my radar but things have been a bit slow due to the 3.0 release. The fix was merged in 3.0 before the release, I still need to cover other branches.

This is a cherry-pick of #2128 to the master branch. #### Which Delta project/connector is this regarding? -Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Fix a data loss bug in MergeIntoCommand. It's caused by using different spark session config object for PreprocessTableMerge and MergeIntoCommand, which is possible when multiple spark sessions are running concurrently. If source dataframe has more columns than target table, auto schema merge feature adds additional nullable column to target table schema. The updated output projection built in PreprocessTableMerge, so `matchedClauses` and `notMatchedClauses` contains the addtional columns, but target table schema in MergeIntoCommand doesn't have it. As a result, the following index doesn't indicate the delete flag column index, which is `numFields - 2`. ``` def shouldDeleteRow(row: InternalRow): Boolean = row.getBoolean(outputRowEncoder.schema.fields.size) ``` row.getBoolean returns `getByte() != 0`, which causes dropping rows randomly. - matched rows in target table loss Also as autoMerge doesn't work - newly added column data in source df loss. The fix makes sure MergeIntoCommand uses the same spark session / config object. Fixes #2104 I confirmed that #2104 is fixed with the change. I confirmed the following by debug log message without the change: 1. matchedClauses has more columns after processRow 2. row.getBoolean(outputRowEncoder.schema.fields.size) refers random column value (It's Unsafe read) 3. canMergeSchema in MergeIntoCommand is false, it was true in PreprocessTableMerge ## Does this PR introduce _any_ user-facing changes? Yes, fixes the data loss issue Closes #2162 Co-authored-by: Chungmin Lee <lee@chungmin.dev> Signed-off-by: Johan Lasperas <johan.lasperas@databricks.com> GitOrigin-RevId: 49acacf8ff1c71d7e6bcb2dc2f709c325211430a

sezruby force-pushed the mergelossfix branch from 3b9af6a to 09c9d83 Compare September 30, 2023 19:21

set active session for commands

2f0c669

Signed-off-by: Eunjin Song <sezruby@gmail.com> Co-authored-by: Chungmin Lee <lee@chungmin.dev>

sezruby force-pushed the mergelossfix branch from 09c9d83 to 2f0c669 Compare October 4, 2023 22:07

sezruby force-pushed the mergelossfix branch from abeb90b to 89f1a55 Compare October 5, 2023 02:29

clee704 reviewed Oct 5, 2023

View reviewed changes

python/delta/tests/test_deltatable.py Outdated Show resolved Hide resolved

test

901cdaa

sezruby force-pushed the mergelossfix branch from 89f1a55 to 901cdaa Compare October 5, 2023 04:03

johanl-db reviewed Oct 5, 2023

View reviewed changes

review commit

7ac6adc

sezruby requested a review from johanl-db October 6, 2023 17:05

johanl-db approved these changes Oct 9, 2023

View reviewed changes

dpksinghal mentioned this pull request Oct 9, 2023

[DeltaTable1.0] Data loss in Delta Table 1.0 merge operation #2104

Closed

sezruby changed the title ~~[Spark] Fix a data loss bug in MergeIntoCommand~~ [Spark][1.0] Fix a data loss bug in MergeIntoCommand Oct 10, 2023

johanl-db self-requested a review October 12, 2023 15:32

executeGenerate

ec4e638

johanl-db approved these changes Nov 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark][1.0] Fix a data loss bug in MergeIntoCommand #2128

[Spark][1.0] Fix a data loss bug in MergeIntoCommand #2128

sezruby commented Sep 30, 2023 •

edited

Loading

sezruby commented Sep 30, 2023

vkorukanti commented Sep 30, 2023

johanl-db commented Oct 2, 2023

sezruby commented Oct 2, 2023 •

edited

Loading

sezruby commented Oct 2, 2023 •

edited

Loading

clee704 commented Oct 3, 2023 •

edited

Loading

vkorukanti commented Oct 3, 2023

johanl-db commented Oct 4, 2023

clee704 commented Oct 4, 2023

sezruby commented Oct 5, 2023 •

edited

Loading

johanl-db left a comment

johanl-db left a comment

johanl-db commented Oct 10, 2023 •

edited

Loading

johanl-db commented Oct 12, 2023

sezruby commented Oct 12, 2023

felipepessoto commented Oct 26, 2023

johanl-db commented Oct 27, 2023

[Spark][1.0] Fix a data loss bug in MergeIntoCommand #2128

Are you sure you want to change the base?

[Spark][1.0] Fix a data loss bug in MergeIntoCommand #2128

Conversation

sezruby commented Sep 30, 2023 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

sezruby commented Sep 30, 2023

vkorukanti commented Sep 30, 2023

johanl-db commented Oct 2, 2023

sezruby commented Oct 2, 2023 • edited Loading

sezruby commented Oct 2, 2023 • edited Loading

clee704 commented Oct 3, 2023 • edited Loading

vkorukanti commented Oct 3, 2023

johanl-db commented Oct 4, 2023

clee704 commented Oct 4, 2023

sezruby commented Oct 5, 2023 • edited Loading

johanl-db left a comment

Choose a reason for hiding this comment

johanl-db left a comment

Choose a reason for hiding this comment

johanl-db commented Oct 10, 2023 • edited Loading

johanl-db commented Oct 12, 2023

sezruby commented Oct 12, 2023

felipepessoto commented Oct 26, 2023

johanl-db commented Oct 27, 2023

sezruby commented Sep 30, 2023 •

edited

Loading

sezruby commented Oct 2, 2023 •

edited

Loading

sezruby commented Oct 2, 2023 •

edited

Loading

clee704 commented Oct 3, 2023 •

edited

Loading

sezruby commented Oct 5, 2023 •

edited

Loading

johanl-db commented Oct 10, 2023 •

edited

Loading