Skip to content

Conversation

@Abacn
Copy link
Contributor

@Abacn Abacn commented Nov 6, 2025

Fix long running tryClaim() blocking concurrent getProgress call that leads to runner lease expire

This fix is to change synchronized {} blocks to a reentrant lock, and in particular, remove synchronized {} for getProgress call. Instead getProgress is exexcuted in best effort

  • if the lock is not held currently, evaluate getProgress

  • if the lock is currently held, report the progress last time evaluated, and

    • evaluate progress when tryClaim or trySplit (most likely slow methods) finish and store in cache

internal tracker: b/440435833

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@github-actions github-actions bot added the java label Nov 6, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Abacn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a concurrency deadlock scenario within the RestrictionTrackers component, specifically preventing getProgress calls from being blocked by ongoing tryClaim operations. By refactoring the synchronization mechanism to use a ReentrantLock and implementing a non-blocking strategy for progress reporting, the system can now provide timely progress updates without waiting for potentially lengthy internal operations to complete. This change enhances the robustness and responsiveness of the runner, mitigating issues like lease expirations.

Highlights

  • Concurrency Fix: Addressed a critical issue where long-running tryClaim() calls could block getProgress() calls, potentially causing runner lease expirations.
  • Non-Blocking Progress: The getProgress() method in RestrictionTrackerObserverWithProgress is now non-blocking, returning the last known progress immediately if the internal lock is held by another operation.
  • Thread Safety Refactor: Replaced method-level synchronized keywords with an explicit ReentrantLock for finer-grained control over synchronization in tryClaim, currentRestriction, trySplit, and checkDone methods.
  • Progress Evaluation Logic: Introduced volatile flags (pendingProgress, lastProgress) and an evalueteProgress() helper to manage progress updates efficiently and asynchronously.
  • New Test Cases: Added specific unit tests to verify that getProgress() remains non-blocking even when tryClaim() or trySplit() operations are intentionally blocked.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Abacn
Copy link
Contributor Author

Abacn commented Nov 6, 2025

/gemini review

@Abacn
Copy link
Contributor Author

Abacn commented Nov 6, 2025

Tested with Dataflow job: 2025-11-06_13_29_21-757930991131699684

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to make RestrictionTrackers.getProgress non-blocking to prevent lease expiration issues. The approach of replacing synchronized blocks with ReentrantLock and using tryLock in getProgress is sound and correctly implemented. My review focuses on a couple of improvements: a typo correction and visibility change for a helper method, and fixing a potential data race in the test code to ensure its reliability. Overall, the changes are good and address the stated problem effectively.

@Abacn
Copy link
Contributor Author

Abacn commented Nov 6, 2025

R: @scwhittle @stankiewicz

@github-actions
Copy link
Contributor

github-actions bot commented Nov 6, 2025

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

return ((HasProgress) delegate).getProgress();
public Progress getProgress() {
pendingProgress = true;
if (lock.tryLock()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using tryLock with timeout to wait for up to X seconds? That seems responsive enough and will help make sure contention doesn't prevent observing the most recent progress

we could also consider logging that we're going to return a stale progress

Copy link
Contributor Author

@Abacn Abacn Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a 1 min timeout. Dataflow side the lease timeout is 10 min.

@Override
public synchronized Progress getProgress() {
return ((HasProgress) delegate).getProgress();
public Progress getProgress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the caller has ordering (like being on the same thread) that guarantees that this is called after a trySplit or tryClaim succeeds, it may be unexpected that this returns a stale progress and lead to weird failures.

Another approach to fixing this would be to make RestrictionTracker.getProgress() quick. In general restriction trackers are lightweight, the ones in Read.java are odd in that they are abusing the restriction tracker tryClaim return payloads. I think that we could instead change them to not do so beneath RestrictionTracker.tryClaim as another way to fix this.

hacky example, needs more thought on trySplit:
#36758

Copy link
Contributor Author

@Abacn Abacn Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it would help Read transform. Here this change is a little bit more generic that apply to all SDFs. We also have other IOs based on SDFs directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the first concern that caller may observe stale progress in cases it itself is calling in order (via being on the same thread or some other happens-before relationship)?
For example, if the split processing thread calls:

  • trySplit
  • getProgress
    then the getProgress response might be stale from before the split. This could possibly confuse the runner. I'm not sure if this is the case or not but I think we'd have to follow the code to make sure this doesn't cause some subtle corruption.

@kennknowles might know

The reason I was suggesting fixing Read itself is that other RestrictionTrackers used by other SDFs don't do expensive work in RestrictionTracker.tryClaim implementation. It is generally just recording the offset etc. Requiring that getProgress is responsive doesn't seem to onerous a requirement for a RestrictionTracker if we document it and then we don't have to worry about these possibly stale responses mixing things up.

Copy link
Contributor Author

@Abacn Abacn Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were cases of inaccurate progress confused runner: prior to #26352 boundedsource as restriction tracker (the scenario of Dataflow runner v2 evaluating boundedsource) always reports zero progress, and the effect was slowness and runner not splitting sources, from the PR description. At least if the stale progress is zero it should be minimum concern (happened before).

If the stale progress is a positive number, this is similar to long running DoFn.process we often observe (in the case of Dataflow, there is a straggler alert) but the pipeline can still work as intended. Let me write a pipeline and test this scenario, would a stale positive prgress make runner eagerly issue trySplit, or other unexpected behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the existing code to check for breakage is good, but we also need a super clear spec here. The Progress object has to be associated with a restriction to have a meaning. So the caller needs to know what restriction it is relative to. How should the caller be sure which restriction the Progress is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the caller needs to know what restriction it is relative to. How should the caller be sure which restriction the Progress is for?

The caller stacktrace, in SDK harness, is

org.apache.beam.fn.harness.control.ProcessBundleHandler#progress
org.apache.beam.fn.harness.control.ProcessBundleHandler#intermediateMonitoringData
org.apache.beam.fn.harness.control.BundleProgressReporter.InMemory#updateIntermediateMonitoringData
org.apache.beam.fn.harness.FnApiDoFnRunner.<anonymous BundleProgressReporter, under case PTransformTranslation.SPLITTABLE_PROCESS_SIZED_ELEMENTS_AND_RESTRICTIONS_URN>.updateIntermediateMonitoringData
org.apache.beam.fn.harness.FnApiDoFnRunner#getProgress

it has a member currentTracker that is a restriction tracker:

((HasProgress) currentTracker).getProgress(), windowCurrentIndex, windowStopIndex);

The caller's restriction tracker is known (assigned in processElementForWindowObservingSizedElementAndRestriction). The restriction tracker's getProgress call simply calls delegate's getProgress:

where delegate in this case is a BoundedSourceAsSDFRestrictionTracker

It's getProgress call invokes currentReader.getFractionConsumed(). I understand at this point restriction tracker relies on the underlying reader to know the progress. After this change, when getProgress stucks, it reports the last claimed progress, which I consider it is still a valid value.

Would these information sufficient to answer the concern here?

@Override
public synchronized Progress getProgress() {
return ((HasProgress) delegate).getProgress();
public Progress getProgress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the first concern that caller may observe stale progress in cases it itself is calling in order (via being on the same thread or some other happens-before relationship)?
For example, if the split processing thread calls:

  • trySplit
  • getProgress
    then the getProgress response might be stale from before the split. This could possibly confuse the runner. I'm not sure if this is the case or not but I think we'd have to follow the code to make sure this doesn't cause some subtle corruption.

@kennknowles might know

The reason I was suggesting fixing Read itself is that other RestrictionTrackers used by other SDFs don't do expensive work in RestrictionTracker.tryClaim implementation. It is generally just recording the offset etc. Requiring that getProgress is responsive doesn't seem to onerous a requirement for a RestrictionTracker if we document it and then we don't have to worry about these possibly stale responses mixing things up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants