Make RestrictionTrackers.getProgress unblocking #36750

Abacn · 2025-11-06T21:40:19Z

Fix long running tryClaim() blocking concurrent getProgress call that leads to runner lease expire

This fix is to change synchronized {} blocks to a reentrant lock, and in particular, remove synchronized {} for getProgress call. Instead getProgress is exexcuted in best effort

if the lock is not held currently, evaluate getProgress
if the lock is currently held, report the progress last time evaluated, and
- evaluate progress when tryClaim or trySplit (most likely slow methods) finish and store in cache

internal tracker: b/440435833

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

gemini-code-assist · 2025-11-06T21:40:36Z

Summary of Changes

Hello @Abacn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a concurrency deadlock scenario within the RestrictionTrackers component, specifically preventing getProgress calls from being blocked by ongoing tryClaim operations. By refactoring the synchronization mechanism to use a ReentrantLock and implementing a non-blocking strategy for progress reporting, the system can now provide timely progress updates without waiting for potentially lengthy internal operations to complete. This change enhances the robustness and responsiveness of the runner, mitigating issues like lease expirations.

Highlights

Concurrency Fix: Addressed a critical issue where long-running tryClaim() calls could block getProgress() calls, potentially causing runner lease expirations.
Non-Blocking Progress: The getProgress() method in RestrictionTrackerObserverWithProgress is now non-blocking, returning the last known progress immediately if the internal lock is held by another operation.
Thread Safety Refactor: Replaced method-level synchronized keywords with an explicit ReentrantLock for finer-grained control over synchronization in tryClaim, currentRestriction, trySplit, and checkDone methods.
Progress Evaluation Logic: Introduced volatile flags (pendingProgress, lastProgress) and an evalueteProgress() helper to manage progress updates efficiently and asynchronously.
New Test Cases: Added specific unit tests to verify that getProgress() remains non-blocking even when tryClaim() or trySplit() operations are intentionally blocked.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Abacn · 2025-11-06T21:43:05Z

/gemini review

Abacn · 2025-11-06T21:44:42Z

Tested with Dataflow job: 2025-11-06_13_29_21-757930991131699684

gemini-code-assist

Code Review

This pull request aims to make RestrictionTrackers.getProgress non-blocking to prevent lease expiration issues. The approach of replacing synchronized blocks with ReentrantLock and using tryLock in getProgress is sound and correctly implemented. My review focuses on a couple of improvements: a typo correction and visibility change for a helper method, and fixing a potential data race in the test code to ensure its reliability. Overall, the changes are good and address the stated problem effectively.

sdks/java/core/src/test/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackersTest.java

sdks/java/core/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java

Abacn · 2025-11-06T21:52:09Z

R: @scwhittle @stankiewicz

github-actions · 2025-11-06T21:53:14Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

scwhittle · 2025-11-07T08:50:42Z

sdks/java/core/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java

-      return ((HasProgress) delegate).getProgress();
+    public Progress getProgress() {
+      pendingProgress = true;
+      if (lock.tryLock()) {


what about using tryLock with timeout to wait for up to X seconds? That seems responsive enough and will help make sure contention doesn't prevent observing the most recent progress

we could also consider logging that we're going to return a stale progress

added a 1 min timeout. Dataflow side the lease timeout is 10 min.

scwhittle · 2025-11-07T14:50:10Z

sdks/java/core/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java

    @Override
-    public synchronized Progress getProgress() {
-      return ((HasProgress) delegate).getProgress();
+    public Progress getProgress() {


If the caller has ordering (like being on the same thread) that guarantees that this is called after a trySplit or tryClaim succeeds, it may be unexpected that this returns a stale progress and lead to weird failures.

Another approach to fixing this would be to make RestrictionTracker.getProgress() quick. In general restriction trackers are lightweight, the ones in Read.java are odd in that they are abusing the restriction tracker tryClaim return payloads. I think that we could instead change them to not do so beneath RestrictionTracker.tryClaim as another way to fix this.

hacky example, needs more thought on trySplit:
#36758

Yeah it would help Read transform. Here this change is a little bit more generic that apply to all SDFs. We also have other IOs based on SDFs directly

What about the first concern that caller may observe stale progress in cases it itself is calling in order (via being on the same thread or some other happens-before relationship)?
For example, if the split processing thread calls:

trySplit

getProgress
then the getProgress response might be stale from before the split. This could possibly confuse the runner. I'm not sure if this is the case or not but I think we'd have to follow the code to make sure this doesn't cause some subtle corruption.

@kennknowles might know

The reason I was suggesting fixing Read itself is that other RestrictionTrackers used by other SDFs don't do expensive work in RestrictionTracker.tryClaim implementation. It is generally just recording the offset etc. Requiring that getProgress is responsive doesn't seem to onerous a requirement for a RestrictionTracker if we document it and then we don't have to worry about these possibly stale responses mixing things up.

There were cases of inaccurate progress confused runner: prior to #26352 boundedsource as restriction tracker (the scenario of Dataflow runner v2 evaluating boundedsource) always reports zero progress, and the effect was slowness and runner not splitting sources, from the PR description. At least if the stale progress is zero it should be minimum concern (happened before).

If the stale progress is a positive number, this is similar to long running DoFn.process we often observe (in the case of Dataflow, there is a straggler alert) but the pipeline can still work as intended. Let me write a pipeline and test this scenario, would a stale positive prgress make runner eagerly issue trySplit, or other unexpected behavior

Following the existing code to check for breakage is good, but we also need a super clear spec here. The Progress object has to be associated with a restriction to have a meaning. So the caller needs to know what restriction it is relative to. How should the caller be sure which restriction the Progress is for?

So the caller needs to know what restriction it is relative to. How should the caller be sure which restriction the Progress is for?

The caller stacktrace, in SDK harness, is

org.apache.beam.fn.harness.control.ProcessBundleHandler#progress org.apache.beam.fn.harness.control.ProcessBundleHandler#intermediateMonitoringData org.apache.beam.fn.harness.control.BundleProgressReporter.InMemory#updateIntermediateMonitoringData org.apache.beam.fn.harness.FnApiDoFnRunner.<anonymous BundleProgressReporter, under case PTransformTranslation.SPLITTABLE_PROCESS_SIZED_ELEMENTS_AND_RESTRICTIONS_URN>.updateIntermediateMonitoringData org.apache.beam.fn.harness.FnApiDoFnRunner#getProgress

it has a member currentTracker that is a restriction tracker:

beam/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java

Line 750 in c8d7ca0

((HasProgress) currentTracker).getProgress(), windowCurrentIndex, windowStopIndex);

The caller's restriction tracker is known (assigned in processElementForWindowObservingSizedElementAndRestriction). The restriction tracker's getProgress call simply calls delegate's getProgress:

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java

Line 105 in c8d7ca0

return ((HasProgress) delegate).getProgress();

where delegate in this case is a BoundedSourceAsSDFRestrictionTracker

It's getProgress call invokes currentReader.getFractionConsumed(). I understand at this point restriction tracker relies on the underlying reader to know the progress. After this change, when getProgress stucks, it reports the last claimed progress, which I consider it is still a valid value.

Would these information sufficient to answer the concern here?

scwhittle · 2025-11-10T09:00:32Z

sdks/java/core/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java

    @Override
-    public synchronized Progress getProgress() {
-      return ((HasProgress) delegate).getProgress();
+    public Progress getProgress() {


What about the first concern that caller may observe stale progress in cases it itself is calling in order (via being on the same thread or some other happens-before relationship)?
For example, if the split processing thread calls:

trySplit

getProgress
then the getProgress response might be stale from before the split. This could possibly confuse the runner. I'm not sure if this is the case or not but I think we'd have to follow the code to make sure this doesn't cause some subtle corruption.

@kennknowles might know

The reason I was suggesting fixing Read itself is that other RestrictionTrackers used by other SDFs don't do expensive work in RestrictionTracker.tryClaim implementation. It is generally just recording the offset etc. Requiring that getProgress is responsive doesn't seem to onerous a requirement for a RestrictionTracker if we document it and then we don't have to worry about these possibly stale responses mixing things up.

Make RestrictionTrackers.getProgress unblocking

a951e4b

github-actions bot added the java label Nov 6, 2025

gemini-code-assist bot reviewed Nov 6, 2025

View reviewed changes

sdks/java/core/src/test/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackersTest.java Outdated Show resolved Hide resolved

sdks/java/core/src/main/java/org/apache/beam/sdk/fn/splittabledofn/RestrictionTrackers.java Outdated Show resolved Hide resolved

comments

50a1404

scwhittle requested changes Nov 7, 2025

View reviewed changes

Abacn added 2 commits November 7, 2025 11:19

address comments - add 1 min blocking time

5e96e2a

Add log

2af9896

scwhittle requested changes Nov 10, 2025

View reviewed changes

Make RestrictionTrackers.getProgress unblocking #36750

Are you sure you want to change the base?

Make RestrictionTrackers.getProgress unblocking #36750

Conversation

Abacn commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Abacn commented Nov 6, 2025

Uh oh!

Abacn commented Nov 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Abacn commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

scwhittle Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Abacn Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scwhittle Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Abacn Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scwhittle Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Abacn Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Abacn Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

scwhittle Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Abacn commented Nov 6, 2025 •

edited

Loading

Abacn Nov 7, 2025 •

edited

Loading

Abacn Nov 7, 2025 •

edited

Loading

Abacn Nov 12, 2025 •

edited

Loading