Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support#1781
Open
huang3eng wants to merge 8 commits intoTHUDM:mainfrom
Open
Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support#1781huang3eng wants to merge 8 commits intoTHUDM:mainfrom
huang3eng wants to merge 8 commits intoTHUDM:mainfrom
Conversation
added 6 commits
March 19, 2026 23:48
…ndow-evict policies
Contributor
Author
|
@zhuzilin @Zhuohao-Li Please help review it~ |
added 2 commits
March 30, 2026 16:24
Contributor
|
Can we break this PR into smaller ones, e.g. staleness version control, and lifecycle hooks can be seperative. |
Contributor
Author
|
@Zhuohao-Li For sure |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
The existing off-policy modes in slime (
one_step_offandfully_async) have significant limitations:train_async.pydid not havebefore_weight_update/after_weight_updatehooks to coordinate rollout and training weight synchronizationWhat's New
This PR introduces two new buffer policies for fully async rollout with comprehensive staleness control and partial rollout support:
1.
legacy_backpressure(staleness_partial mode)Inspired by VERL's fully async implementation, this policy:
2.
window_evict(window_partial mode)Inspired by MiniMax Forge's sliding-version-window eviction, this policy:
[current_version - W, current_version]Feature Comparison
legacy_backpressurewindow_evictPartial Rollout & Off-Policy Masking
When
--partial-rolloutis enabled:--mask-offpolicy-in-partial-rollout, off-policy tokens are masked during training loss computationLifecycle Hooks
Added
before_weight_update/after_weight_updatehooks totrain_async.pyandRolloutManager, enabling the async worker to:New CLI Arguments
--staleness-threshold--fully-async-buffer-policylegacy_backpressurelegacy_backpressureorwindow_evict.--fully-async-version-windowwindow_evict.--fully-async-max-completed-samples--fully-async-eviction-policydrop_oldest_versionwindow_evict.--fully-async-debug-version-trackingBenchmark Results
📊 Experiment Dashboard: wandb.ai/huang3eng-alibaba/slime-async-release
The benchmark script (
run-qwen3.5-4b-off-policy-benchmark.sh) demonstrates:staleness_partialandwindow_partialmodes show performance parity with the basefully_asyncmode while providing staleness/version controlQuick Start
Wandb Metrics
When enabled, the following metric groups are logged under a dedicated
fully_async/stepaxis:fully_async/count/*: stale samples processed, consumed, recycled, droppedfully_async/partial/*: partial rollout ratio and max version spanfully_async/window/*: completed store size, eligible samples, eviction countsFiles Changed
examples/fully_async/fully_async_rollout.py: Core async worker implementation with buffer policiesexamples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh: Multi-mode benchmark scriptexamples/fully_async/README.md: Comprehensive documentationtrain_async.py: Added lifecycle hooks integrationslime/ray/rollout.py: Addedbefore_weight_update/after_weight_updatehook forwarding inRolloutManagertests/test_fully_async_rollout.py: Unit tests for staleness control and version trackingtests/test_rollout_manager_fully_async_metrics.py: Tests for metrics loggingThis PR provides a flexible foundation for off-policy RL training with proper staleness control, enabling users to choose between backpressure-style control (VERL-inspired) and window-eviction-style control (Forge-inspired) based on their specific requirements.