Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Dec 8, 2025

What changes were proposed in this pull request?

This patch adds a safety check into ObservationManager.tryComplete to avoid Observation blocking.

Why are the changes needed?

We got reports that for some corner cases Observation.get will be blocked forever. It is not deadlock case after investigation. If the CollectMetricsExec operator was optimized away, e.g., the executed plan was optimized to have some empty relation propagation on top of plan tree of CollectMetricsExec, Spark won't fulfill the promise in Observation and get calls will be blocked forever.

Does this PR introduce any user-facing change?

Yes. Previously for some corner cases Observation.get call will be blocked forever. After this change, get will return an empty map.

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@viirya viirya marked this pull request as draft December 8, 2025 07:25
@github-actions github-actions bot added the SQL label Dec 8, 2025
@viirya
Copy link
Member Author

viirya commented Dec 8, 2025

I don't know why but I cannot run sbt test locally. So I want to run the test in branch-3.5 CI first.

@viirya
Copy link
Member Author

viirya commented Dec 8, 2025

I don't know why but I cannot run sbt test locally. So I want to run the test in branch-3.5 CI first.

Okay. CI confirmed that the issue is also in branch-3.5.

https://github.com/viirya/spark-1/actions/runs/20020082078/job/57405065589

[info] *** Test still running after 5 hours, 16 minutes, 5 seconds: suite name: DatasetSuite, test name: SPARK-54620: Observation should not blocking forever. 

@viirya viirya force-pushed the fix_observation_blocking_3.5 branch from 3ef18f4 to 00dc353 Compare December 8, 2025 23:05
@viirya viirya marked this pull request as ready for review December 8, 2025 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant