Skip to content

Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support#1781

Open
huang3eng wants to merge 8 commits intoTHUDM:mainfrom
huang3eng:feat/async-off-policy
Open

Enhanced Off-Policy Async Rollout with Staleness Control and Partial Rollout Support#1781
huang3eng wants to merge 8 commits intoTHUDM:mainfrom
huang3eng:feat/async-off-policy

Conversation

@huang3eng
Copy link
Copy Markdown
Contributor

@huang3eng huang3eng commented Mar 30, 2026

Background

The existing off-policy modes in slime (one_step_off and fully_async) have significant limitations:

  • No staleness/version control: Neither mode tracks policy versions or controls how stale samples can be before they become unusable
  • No partial rollout support: Both modes drain in-flight rollout tasks to completion before weight updates, making partial rollout recycling impossible
  • Lack of lifecycle hooks: The original train_async.py did not have before_weight_update / after_weight_update hooks to coordinate rollout and training weight synchronization

What's New

This PR introduces two new buffer policies for fully async rollout with comprehensive staleness control and partial rollout support:

1. legacy_backpressure (staleness_partial mode)

Inspired by VERL's fully async implementation, this policy:

  • Pauses scheduling when the number of stale samples reaches a configurable budget:
    budget = rollout_batch_size × update_weights_interval × (1 + staleness_threshold)
    
  • Resumes after the trainer consumes enough samples to bring the stale count below the budget
  • Adds policy-version tracking and stale-sample accounting

2. window_evict (window_partial mode)

Inspired by MiniMax Forge's sliding-version-window eviction, this policy:

  • Keeps rollout scheduling always active — never pauses
  • Actively evicts completed samples whose policy version falls outside a sliding window [current_version - W, current_version]
  • Trades sample efficiency for higher GPU utilization and stricter version lag control
  • Bounds per-trajectory version span to ≤ W+1 versions under partial rollout

Feature Comparison

Feature legacy_backpressure window_evict
Scheduling Pauses when stale budget reached Never pauses, always scheduling
Sample Eviction No eviction Actively evicts out-of-window samples
GPU Utilization May have idle periods Always high utilization
Version Lag Control Soft control (backlog ratio) Hard control (window width W)
Partial Rollout Span May span many versions Bounded to ≤ W+1 versions

Partial Rollout & Off-Policy Masking

When --partial-rollout is enabled:

  • In-flight rollout tasks are aborted before each weight update (rather than drained to completion)
  • Partially generated samples are returned to the data buffer and re-scheduled under the new policy
  • Combined with --mask-offpolicy-in-partial-rollout, off-policy tokens are masked during training loss computation

Lifecycle Hooks

Added before_weight_update / after_weight_update hooks to train_async.py and RolloutManager, enabling the async worker to:

  1. Pause scheduling and drain/abort in-flight tasks before weights change
  2. Update internal policy version, evict out-of-window samples, and resume after weights are synced
  3. Report per-interval staleness and eviction metrics to wandb

New CLI Arguments

Argument Type Default Description
--staleness-threshold float None Max stale backlog ratio. Enables backpressure when set.
--fully-async-buffer-policy str legacy_backpressure Buffer policy: legacy_backpressure or window_evict.
--fully-async-version-window int 1 Policy-version window width for window_evict.
--fully-async-max-completed-samples int auto Hard cap on completed samples in memory.
--fully-async-eviction-policy str drop_oldest_version Overflow eviction strategy for window_evict.
--fully-async-debug-version-tracking flag False Print per-batch version summaries for debugging.

Benchmark Results

📊 Experiment Dashboard: wandb.ai/huang3eng-alibaba/slime-async-release

The benchmark script (run-qwen3.5-4b-off-policy-benchmark.sh) demonstrates:

  • ~70+% throughput improvement for async modes compared to sync baseline
W B Chart 2026_3_30 17_27_12
  • Consistent reward convergence — async training matches sync training quality
W B Chart 2026_3_30 17_27_41
  • Tests for staleness_partial and window_partial modes show performance parity with the base fully_async mode while providing staleness/version control

Quick Start

# One-step off-policy async baseline (default rollout, no fully async worker)
MODE=one_step_off      bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

# Fully async, no staleness control
MODE=fully_async       bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

# Fully async + staleness backpressure + partial rollout
MODE=staleness_partial bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

# Fully async + version-window eviction + partial rollout
MODE=window_partial    bash examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh

Wandb Metrics

When enabled, the following metric groups are logged under a dedicated fully_async/step axis:

  • fully_async/count/*: stale samples processed, consumed, recycled, dropped
  • fully_async/partial/*: partial rollout ratio and max version span
  • fully_async/window/*: completed store size, eligible samples, eviction counts

Files Changed

  • examples/fully_async/fully_async_rollout.py: Core async worker implementation with buffer policies
  • examples/fully_async/run-qwen3.5-4b-off-policy-benchmark.sh: Multi-mode benchmark script
  • examples/fully_async/README.md: Comprehensive documentation
  • train_async.py: Added lifecycle hooks integration
  • slime/ray/rollout.py: Added before_weight_update/after_weight_update hook forwarding in RolloutManager
  • tests/test_fully_async_rollout.py: Unit tests for staleness control and version tracking
  • tests/test_rollout_manager_fully_async_metrics.py: Tests for metrics logging

This PR provides a flexible foundation for off-policy RL training with proper staleness control, enabling users to choose between backpressure-style control (VERL-inspired) and window-eviction-style control (Forge-inspired) based on their specific requirements.

@huang3eng
Copy link
Copy Markdown
Contributor Author

@zhuzilin @Zhuohao-Li Please help review it~

@Zhuohao-Li
Copy link
Copy Markdown
Contributor

Can we break this PR into smaller ones, e.g. staleness version control, and lifecycle hooks can be seperative.
and please draft a detailed plan under issue for @zhuzilin discussion before requesting a large PR :)

@huang3eng
Copy link
Copy Markdown
Contributor Author

@Zhuohao-Li For sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants