Skip to content

Conversation

@jQizhang
Copy link
Contributor

@jQizhang jQizhang commented Feb 9, 2026

What does this PR do?

This PR adds support for NVFP4 (W4A16) Quantization-Aware Training (QAT) in verl's Megatron training pipeline, with quantized weight transfer to the vLLM rollout engine for inference. The implementation leverages NVIDIA ModelOpt for quantization.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

This feature has been validated through end-to-end QAT training experiments:

  • Model: Qwen3-8B (dense) and Qwen3-30B-A3B (MoE)
  • Setup: Training with Megatron, inference with vLLM
  • Quantization: NVFP4 W4A16 via ModelOpt
  • Verification: Confirmed that QAT-calibrated quantized weights are correctly transferred to vLLM and produce valid inference outputs (matching ModelOpt standalone export baselines)

API and Usage Example

Enable NVFP4 QAT by adding the following to the actor Megatron config:

actor:
  megatron:
    quantization: nvfp4      # Quantization method
    enable_qat: true          # Enable QAT during training

rollout:
  quantization: nvfp4         # Tell vLLM to expect NVFP4 quantized weights

The training loop automatically:

  1. Applies NVFP4 QAT quantizers to the Megatron actor model at initialization
  2. Trains with simulated quantization (forward pass uses fake-quantized weights)
  3. Before rollout, converts bf16 weights to NVFP4 format (packed uint8 + scales) and streams them to vLLM
  4. vLLM loads the quantized weights using patched ModelOptNvFp4LinearMethod / ModelOptNvFp4FusedMoE for inference

No changes to the training script are required beyond the configuration above.

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for NVFP4 Quantization-Aware Training (QAT) by integrating with NVIDIA's ModelOpt. The changes are comprehensive, touching upon configuration, the Megatron training worker, and adding utility modules for QAT processing and vLLM patching. The implementation appears well-thought-out, especially in handling the complexities of distributed training environments (TP, PP, EP). My main feedback is on a piece of duplicated code that should be refactored to improve maintainability. Overall, this is a solid feature addition.

Comment on lines +144 to +157
def _create_param_from_subclass_attributes(custom_data, custom_weight):
param = Parameter(custom_data, requires_grad=False)
base_param_dir = dir(torch.nn.Parameter)
custom_weight_dir = dir(custom_weight)
# Find the attributes that are unique to the custom parameter
custom_attributes = [
attr for attr in custom_weight_dir if attr not in base_param_dir and not attr.startswith("__")
]
# Set the custom attributes into the base parameter object
for attr in custom_attributes:
setattr(param, attr, getattr(custom_weight, attr))

return param

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This helper function _create_param_from_subclass_attributes is a duplicate of the one defined at the top level of this file (line 107). To avoid code duplication and improve maintainability, please remove this inner function and use the top-level one instead. The other function process_weights_after_loading_moe in this file already uses the top-level helper.

Comment on lines +76 to +79
quantization: null

# Whether to enable Quantization-Aware Training (QAT). Default False.
enable_qat: False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use the same configs as in FSDP to avoid confusing the users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants