[SPARK-54157][SQL] Fix refresh of DSv2 tables in Dataset #52920

aokolnychyi · 2025-11-06T15:51:08Z

What changes were proposed in this pull request?

This PR fixes refresh of DSv2 tables in Dataset.

Why are the changes needed?

Prior to this change, Spark would pin the version of DSv2 tables at load/resolution time. Any changes within the session will not be propagated to the analyzed but not yet executed Dataset, breaking the behavior compared to DSv1 tables. Changes in this PR are needed for the following reasons:

Prevent scanning/joining inconsistent versions of the table in the same session (see tests).
Prevent stale results upon external and session changes.
Remove workarounds in DSv2 connectors by fixing the problem in Spark.

Does this PR introduce any user-facing change?

Yes, but this PR makes DSv2 Table behavior match the expected Spark semantics.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

aokolnychyi · 2025-11-06T15:55:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

    }
  }

+  // refresh table versions before looking up cache


I believe we need a new stage because this must be done after analysis but before we normalize the plan for cache lookup. It would be great to use FinishAnalysis in the optimizer but I feel like it is too late.

I also worry about just doing this refresh all the time. We may remember the analysis finish time and only refresh if it is older than 100ms or so to not refresh unnecessary.

Thoughts?

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V2TableUtil.scala

vrozov · 2025-11-06T17:09:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V2TableUtil.scala

+            errors += s"`${originCol.name}` type has changed from $oldType to $newType"
+          }
+        case None =>
+          errors += s"${formatColumn(originCol)} is missing"


should it be s"${formatColumn(originCol)} has been deleted" to be consistent with the other error message?

cloud-fan · 2025-11-06T18:57:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

  }

+  // refresh table versions before looking up cache
+  private val lazyTableVersionsPinned = LazyTry {


Suggested change

private val lazyTableVersionsPinned = LazyTry {

private val lazyTableVersionsRefreshed = LazyTry {

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V2TableUtil.scala

cloud-fan · 2025-11-06T19:16:26Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala

+          "tableName" -> "`testcat`.`ns1`.`ns2`.`tbl`",
+          "errors" ->
+            ("\n- `person` type has changed from STRUCT<name: STRING, age: INT> " +
+              "to STRUCT<name: STRING, age: INT, city: STRING>")))


the error message will be hard to read with super wide or deeply nested struct types. I think we should perform the check recursively and point to the exact nested fields in the error message.

I agree but it would be error-prone to iterate field by field. I would probably address this in a separate PR as this one is already pretty large and tricky.

cloud-fan · 2025-11-07T21:43:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala


+  // refresh table versions before cache lookup
+  private val lazyTableVersionsRefreshed = LazyTry {
+    if (QueryExecution.lastExecutionId != id || TableRefreshUtil.shouldRefresh(commandExecuted)) {


what does QueryExecution.lastExecutionId != id indicate here?

To trigger refresh if there were any query executions between this Dataset analysis and execution. For instance, we must always refresh if there is ALTER in between.

The logic here is to refresh always unless Dataset is created and executed immediately without any temporary steps in between.

Does this make sense, @cloud-fan?

It kind of make sense but I don't fully agree. I think the chance is low that we need to refresh the tables after a new execution. It may be a scan query execution or maybe altering other tables. This hurts perf a lot for a busy cluster serving many short queries at the same time.

I think a simple time-based refresh policy is good enough.

cloud-fan · 2025-11-07T21:49:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/TableRefreshUtil.scala

+        val freshTable = cache.getOrElseUpdate((catalog, ident), {
+          val tableName = V2TableUtil.toQualifiedName(catalog, ident)
+          logDebug(s"Refreshing table metadata for $tableName")
+          catalog.loadTable(ident)


if any table needs refresh, we refresh all the tables in the plan, is it intentional? shall we record the tables that need to be refreshed?

Yep, this was intentional to be safe in case there are dependent operations that modify multiple tables at the same time. It is safer to always refresh everything (keep in mind this is only for versioned tables where refresh is cheap).

github-actions bot added the SQL label Nov 6, 2025

aokolnychyi commented Nov 6, 2025

View reviewed changes

vrozov reviewed Nov 6, 2025

View reviewed changes

cloud-fan reviewed Nov 6, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V2TableUtil.scala Show resolved Hide resolved

cloud-fan reviewed Nov 6, 2025

View reviewed changes

aokolnychyi force-pushed the spark-54157 branch from 7b1fade to 0036037 Compare November 7, 2025 19:36

cloud-fan reviewed Nov 7, 2025

View reviewed changes

[SPARK-54157][SQL] Fix refresh of DSv2 tables in Dataset

e85e600

aokolnychyi force-pushed the spark-54157 branch from 0036037 to e85e600 Compare November 10, 2025 20:51

	private val lazyTableVersionsPinned = LazyTry {
	private val lazyTableVersionsRefreshed = LazyTry {

[SPARK-54157][SQL] Fix refresh of DSv2 tables in Dataset #52920

Are you sure you want to change the base?

[SPARK-54157][SQL] Fix refresh of DSv2 tables in Dataset #52920

Conversation

aokolnychyi commented Nov 6, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloud-fan Nov 6, 2025 •

edited

Loading