Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support by huang3eng · Pull Request #1781 · THUDM/slime

huang3eng · 2026-03-30T08:15:48Z

Background

The existing off-policy modes in slime (one_step_off and fully_async) have significant limitations:

No staleness/version control: Neither mode tracks policy versions or controls how stale samples can be before they become unusable
No partial rollout support: Both modes drain in-flight rollout tasks to completion before weight updates, making partial rollout recycling impossible
Lack of lifecycle hooks: The original train_async.py did not have before_weight_update / after_weight_update hooks to coordinate rollout and training weight synchronization

What's New

This PR introduces two new buffer policies for fully async rollout with comprehensive staleness control and partial rollout support:

1. `legacy_backpressure` (staleness_partial mode)

Inspired by VERL's fully async implementation, this policy:

Pauses scheduling when the number of stale samples reaches a configurable budget:

budget = rollout_batch_size × update_weights_interval × (1 + staleness_threshold)

Resumes after the trainer consumes enough samples to bring the stale count below the budget
Adds policy-version tracking and stale-sample accounting

2. `window_evict` (window_partial mode)

Inspired by MiniMax Forge's sliding-version-window eviction, this policy:

Keeps rollout scheduling always active — never pauses
Actively evicts completed samples whose policy version falls outside a sliding window [current_version - W, current_version]
Trades sample efficiency for higher GPU utilization and stricter version lag control
Bounds per-trajectory version span to ≤ W+1 versions under partial rollout

Feature Comparison

Feature	`legacy_backpressure`	`window_evict`
Scheduling	Pauses when stale budget reached	Never pauses, always scheduling
Sample Eviction	No eviction	Actively evicts out-of-window samples
GPU Utilization	May have idle periods	Always high utilization
Version Lag Control	Soft control (backlog ratio)	Hard control (window width W)
Partial Rollout Span	May span many versions	Bounded to ≤ W+1 versions

Partial Rollout & Off-Policy Masking

When --partial-rollout is enabled:

In-flight rollout tasks are aborted before each weight update (rather than drained to completion)
Partially generated samples are returned to the data buffer and re-scheduled under the new policy
Combined with --mask-offpolicy-in-partial-rollout, off-policy tokens are masked during training loss computation

Lifecycle Hooks

Added before_weight_update / after_weight_update hooks to train_async.py and RolloutManager, enabling the async worker to:

Pause scheduling and drain/abort in-flight tasks before weights change
Update internal policy version, evict out-of-window samples, and resume after weights are synced
Report per-interval staleness and eviction metrics to wandb

New CLI Arguments

Argument	Type	Default	Description
`--staleness-threshold`	float	None	Max stale backlog ratio. Enables backpressure when set.
`--fully-async-buffer-policy`	str	`legacy_backpressure`	Buffer policy: `legacy_backpressure` or `window_evict`.
`--fully-async-version-window`	int	1	Policy-version window width for `window_evict`.
`--fully-async-max-completed-samples`	int	auto	Hard cap on completed samples in memory.
`--fully-async-eviction-policy`	str	`drop_oldest_version`	Overflow eviction strategy for `window_evict`.
`--fully-async-debug-version-tracking`	flag	False	Print per-batch version summaries for debugging.

Benchmark Results

📊 Experiment Dashboard: wandb.ai/huang3eng-alibaba/slime-async-release

The benchmark script (run-qwen3.5-4b-off-policy-benchmark.sh) demonstrates:

~70+% throughput improvement for async modes compared to sync baseline

Consistent reward convergence — async training matches sync training quality

Tests for staleness_partial and window_partial modes show performance parity with the base fully_async mode while providing staleness/version control

Quick Start

# One-step off-policy async baseline (default rollout, no fully async worker)
MODE=one_step_off      bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

# Fully async, no staleness control
MODE=fully_async       bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

# Fully async + staleness backpressure + partial rollout
MODE=staleness_partial bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

# Fully async + version-window eviction + partial rollout
MODE=window_partial    bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

Wandb Metrics

When enabled, the following metric groups are logged under a dedicated fully_async/step axis:

fully_async/count/*: stale samples processed, consumed, recycled, dropped
fully_async/partial/*: partial rollout ratio and max version span
fully_async/window/*: completed store size, eligible samples, eviction counts

Files Changed

examples/fully_async/fully_async_rollout.py: Core async worker implementation with buffer policies
examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh: Multi-mode benchmark script
examples/fully_async/README.md: Comprehensive documentation
train_async.py: Added lifecycle hooks integration
slime/ray/rollout.py: Added before_weight_update/after_weight_update hook forwarding in RolloutManager
tests/test_fully_async_rollout.py: Unit tests for staleness control and version tracking
tests/test_rollout_manager_fully_async_metrics.py: Tests for metrics logging

This PR provides a flexible foundation for off-policy RL training with proper staleness control, enabling users to choose between backpressure-style control (VERL-inspired) and window-eviction-style control (Forge-inspired) based on their specific requirements.

…ndow-evict policies

huang3eng · 2026-03-30T08:20:24Z

@zhuzilin @Zhuohao-Li Please help review it～

…-off-policy

Zhuohao-Li · 2026-03-31T22:20:26Z

Can we break this PR into smaller ones, e.g. staleness version control, and lifecycle hooks can be seperative.
and please draft a detailed plan under issue for @zhuzilin discussion before requesting a large PR :)

huang3eng · 2026-04-01T05:48:27Z

@Zhuohao-Li For sure

benyi added 6 commits March 19, 2026 23:48

Support qwen3.5 loss mask for multi-turn SFT

b4e7661

style: format qwen3.5 loss mask code

843a5ba

Merge branch 'main' of https://github.com/THUDM/slime

70ab505

Merge branch 'main' of https://github.com/THUDM/slime

8d158d2

Merge branch 'main' of https://github.com/THUDM/slime

0128ad2

feat: add off-policy async rollout with staleness backpressure and wi…

a83862c

…ndow-evict policies

benyi added 2 commits March 30, 2026 16:24

Merge branch 'main' of https://github.com/THUDM/slime into feat/async…

6643e3a

…-off-policy

style: format code

f3e2562

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support#1781

Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support#1781
huang3eng wants to merge 8 commits intoTHUDM:mainfrom
huang3eng:feat/async-off-policy

huang3eng commented Mar 30, 2026 •

edited

Loading

Uh oh!

huang3eng commented Mar 30, 2026

Uh oh!

Zhuohao-Li commented Mar 31, 2026

Uh oh!

huang3eng commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huang3eng commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

What's New

1. legacy_backpressure (staleness_partial mode)

2. window_evict (window_partial mode)

Feature Comparison

Partial Rollout & Off-Policy Masking

Lifecycle Hooks

New CLI Arguments

Benchmark Results

Quick Start

Wandb Metrics

Files Changed

Uh oh!

huang3eng commented Mar 30, 2026

Uh oh!

Zhuohao-Li commented Mar 31, 2026

Uh oh!

huang3eng commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huang3eng commented Mar 30, 2026 •

edited

Loading

1. `legacy_backpressure` (staleness_partial mode)

2. `window_evict` (window_partial mode)