Skip to content

Conversation

@zhihaofang1017
Copy link
Contributor

What does this PR do?

Add NPU GRPO training scripts for Qwen3-VL-8B (FSDP/VLLM backends). The reward curves of this scenario are also shown.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

20260209-144805

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new training script for GRPO on Qwen3-VL-8B with NPU backends. The script is comprehensive, but I've identified a couple of critical issues with the Hydra configuration that would likely cause the script to fail at startup. Specifically, an invalid value is used for rollout_rs, and an undefined parameter rollout_token_veto_threshold is passed to the trainer. My review provides suggestions to fix these configuration errors.

rollout_is_batch_normalize=true
rollout_rs=token
rollout_rs_threshold=0.6_1.6
rollout_token_veto_threshold=1e-4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable rollout_token_veto_threshold sets a configuration parameter that is not defined in the RolloutCorrectionConfig dataclass. This will cause a ValidationError at startup. This line and its usage on line 85 should be removed.

algorithm.rollout_correction.rollout_is_batch_normalize=${rollout_is_batch_normalize} \
algorithm.rollout_correction.rollout_rs=${rollout_rs} \
algorithm.rollout_correction.rollout_rs_threshold=${rollout_rs_threshold} \
algorithm.rollout_correction.rollout_token_veto_threshold=${rollout_token_veto_threshold} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The configuration parameter algorithm.rollout_correction.rollout_token_veto_threshold is not defined in the RolloutCorrectionConfig dataclass (verl/trainer/config/algorithm.py). This will cause a ValidationError and crash the script. This line and the variable definition on line 20 should be removed.

@zhihaofang1017 zhihaofang1017 marked this pull request as draft February 11, 2026 08:38
@zhihaofang1017 zhihaofang1017 marked this pull request as ready for review February 11, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant