[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

szehon-ho · 2025-11-04T01:36:12Z

What changes were proposed in this pull request?

Change MERGE INTO schema evolution scope. Limit the scope of schema evolution to only add columns/nested fields that exist in source and which are directly assigned to the source column without transformation.

ie,

UPDATE SET new_col = source.new_col 
UPDATE SET struct.new_field = source.struct.new_field
INSERT (old_col, new_col) VALUES (s.old_col, s.new_col)

Why are the changes needed?

#51698 added schema evolution support for MERGE INTO statements. However, it is a bit too broad. In some instances, source table may have many more fields than target tables. But user may only need a few new ones to be added to the target for the MERGE INTO statement.

Does this PR introduce any user-facing change?

No, MERGE INTO schema evolution is not yet released in Spark 4.1.

How was this patch tested?

Added many unit tests in MergeIntoTableSuiteBase

Was this patch authored or co-authored using generative AI tooling?

No

…nced columns

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

szehon-ho · 2025-11-04T20:05:42Z

@cloud-fan @aokolnychyi can you take a look? i think this is an important improvement to get in before we release MERGE INTO WITH SCHEMA EVOLUTION feature in Spark 4.1, thanks!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2025-11-05T04:43:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

+             |USING source s
+             |ON t.pk = s.pk
+             |WHEN MATCHED THEN
+             | UPDATE SET dep='software'


This test is weird, dep is an existing column in the target table, and we for sure do not need to do schema evolution. What was the behavior before this PR?

oh its because source table has more colunns but they are not used..

cloud-fan · 2025-11-05T18:37:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

+             |ON t.pk = s.pk
+             |WHEN NOT MATCHED THEN
+             | INSERT (pk, info, dep) VALUES (s.pk,
+             |   named_struct('salary', s.info.salary, 'status', 'active'), 'marketing')


why do we trigger schema evolution for this case?

discuss offline, refine the logic to be more selective (direct assignment to source column)

…ignment where value is same name in source

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2025-11-07T19:26:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+  private lazy val sourceSchemaForEvolution: StructType =
+    MergeIntoTable.sourceSchemaForSchemaEvolution(this)
+
+  lazy val needSchemaEvolution: Boolean = {


I think the rule ResolveMergeIntoSchemaEvolution should be triggered as long as MergeIntoTable#schemaEvolutionEnabled is true. These complicated logic should be moved into ResolveMergeIntoSchemaEvolution and the rule returns the merge command unchanged if schema evolution is not needed.

To make ResolveMergeIntoSchemaEvolution more reliable about rule orders, we should wait for the merge assignment values to be resolved before entering the rule. At the beginning of the rule, resolve the merge assignment keys again to make sure rule order does not matter. We can stop earlier if the assignment values are not pure field reference and there is no star.

cloud-fan · 2025-11-07T22:42:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+    val assignmentValueExpr = extractFieldPath(assignment.value)
+    // Valid assignments are: col = s.col or col.nestedField = s.col.nestedField
+    assignmenKeyExpr.length == path.length && isPrefix(assignmenKeyExpr, path) &&
+      isSuffix(path, assignmentValueExpr)


is this only to skip the source table qualifier? it seems wrong to trigger schema evolution for col = wrong_table.col which should fail analysis without schema evolution.

szehon-ho · 2025-11-10T06:18:33Z

Discuss offline with @cloud-fan. The new condition unfortunately forces us to move ResolveMergeIntoSchamEvolution to be evaluated after an initial pass of ResolveReferences. Because we are checking value assignments (from source table), it is important that these are resolved so we can be sure they mean its an assignment from there. It is now a bit more complex workflow:

Overall the logic is:

ResolveReferences resolves all columns it can. But leave unresolved assignment because it can be an assignment key that does not exist yet in the target schema, and will be added later in schema evolution.
ResolveMergeIntoSchemaEvolution now runs after ResolveReferences. It must unresolve all expressions, because they were resolved on the old table. This triggers another run of ResolveReferences.
The final run of ResolveReferences will resolve all references based on the new target table.

Many changes:

Change ResolveReferences to no longer eagerly throw exception on the first run, but to throw on the second run after schema evolution is evaluated.
Change ResolveReferences to expand UPDATE SET * and INSERT * to fill in missing assignments for columns in source and not target (to trigger the ResolveMergeIntoSchamEvolution condition)
ResolveMergeIntoSchemaEvolution: Add a guard MergeIntoTable.canEvaluateSchemaEvolution to not trigger until ResolveReferences is run the first time (it checks if all assignments are either resolved, or they are possibly solved by schema evolution)
ResolveMergeIntoSchamEvolution: calculate sourceSchemaForEvolution which is the columns in source schema that will be added to target, pruning those that are directly subject of an assignment where the key does not exist in column but is referenced by a assignment value from a source column/field of the same name.
ResolveMergeIntoSchamEvolution: unresolve everything. Because it runs now after ResolveReferences, we need to re-resolve everything because the target table changes. This will trigger another run of ResolveReferences.

cloud-fan · 2025-11-10T16:50:55Z

.../src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveMergeIntoSchemaEvolution.scala

        }
-        m.copy(targetTable = newTarget)
+
+        // Unresolve all references based on old target output


This is tricky, I think we can just update their attribute ids? See QueryPlan#transformUpWithNewOutput

Actually originally I selectively targeted DataSourceV2Relation under targetTable field.

m.targetTable transform { case r: DataSourceV2Relation => ... }

But now I need to run the rule on the top object (m) because the attributes to rewrite are child of m.

I could not figure out how to get it to rewrite attributes if I match m itself, ie

m transformWithNewOutput { case _: MergeIntoTable => _.targetTable transform { case r: DataSourceV2Relation } }

because I think this method only populates attributeMap if the rule targets a child and now m itself? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L349 unless I missed it.

m transformWithNewOutput { case r: DataSourceV2Relation(SupportsRowLevelOperation, ...) => }

So hence now I make an assumption that the DataSourceV2Relation is the target table. Currently it is the case because only target table has a SupportsRowLevelOperationTable object, but just calling this out.

cloud-fan · 2025-11-10T16:53:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        // This allows unresolved assignment keys a chance to be resolved by a second pass
+        // by newly column/nested fields added by schema evolution.
+        // If schema evolution has already had a chance to run, this will be the final pass
+        val throws = !m.schemaEvolutionEnabled ||


"Not throw" is not a big deal, CheckAnalysis will fail it correctly at the end. I think it's OK to swallow the error if the target table supports schema evolution.

ok, but i have to keep it to keep behavior for non-schema changes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2025-11-10T16:59:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

    schemaEvolutionEnabled &&
-      MergeIntoTable.schemaChanges(targetTable.schema, sourceTable.schema).nonEmpty
+      canEvaluateSchemaEvolution &&
+      schemaChangesNonEmpty


do we really need this condition? If there is no schema change, then the rule ResolveMergeIntoSchemaEvolution returns the original merge command. If we need to do schema change, we calculate it twice with the current code.

moved. Actually future rules still need this condition to run only if no schema evolution needed, so keep the flag for those. But simplified the ResovleMergeIntoSchemaEvolution condition

[SPARK-54172][SQL] Merge Into Schema Evolution should only add refere…

3daaf68

…nced columns

github-actions bot added the SQL label Nov 4, 2025

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch 3 times, most recently from 41731d2 to 6c6de51 Compare November 4, 2025 20:02

szehon-ho commented Nov 4, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

Refactor and add more test

24b1a51

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 6c6de51 to 24b1a51 Compare November 4, 2025 20:06

cloud-fan reviewed Nov 4, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Show resolved Hide resolved

cloud-fan reviewed Nov 5, 2025

View reviewed changes

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 448bfdf to 8ecc4ad Compare November 6, 2025 22:59

Only allow schema evolution for case where new field is target of ass…

abbeb1e

…ignment where value is same name in source

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 8ecc4ad to abbeb1e Compare November 6, 2025 23:02

szehon-ho added 2 commits November 6, 2025 16:14

Add more tests

1461265

minor cleanup

451da3e

cloud-fan reviewed Nov 7, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Outdated Show resolved Hide resolved

review comment

cee88a2

cloud-fan reviewed Nov 7, 2025

View reviewed changes

Fix attempt

f23a985

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 1a782ff to f23a985 Compare November 9, 2025 03:16

Simplify logic

ea00e3c

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from fb77395 to 218b1c9 Compare November 10, 2025 06:07

More cleanup

4050f94

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 218b1c9 to 4050f94 Compare November 10, 2025 06:31

More simplification

4ac5b0a

cloud-fan reviewed Nov 10, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 10, 2025

View reviewed changes

Review comments

ebe84ac

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from dada2b9 to ebe84ac Compare November 11, 2025 01:17

Fix test

6c08b54

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

Are you sure you want to change the base?

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

Conversation

szehon-ho commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

szehon-ho commented Nov 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho commented Nov 4, 2025 •

edited

Loading

cloud-fan Nov 7, 2025 •

edited

Loading

szehon-ho commented Nov 10, 2025 •

edited

Loading

szehon-ho Nov 11, 2025 •

edited

Loading