Skip to content

[None][chore] Enable the weight and weight scale padding for NVFP4 TRTLLMGenFusedMoE BaseMethod#12031

Open
leslie-fang25 wants to merge 3 commits intoNVIDIA:mainfrom
leslie-fang25:leslie/fix_nvfp4_padding
Open

[None][chore] Enable the weight and weight scale padding for NVFP4 TRTLLMGenFusedMoE BaseMethod#12031
leslie-fang25 wants to merge 3 commits intoNVIDIA:mainfrom
leslie-fang25:leslie/fix_nvfp4_padding

Conversation

@leslie-fang25
Copy link
Collaborator

@leslie-fang25 leslie-fang25 commented Mar 9, 2026

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved NVFP4 quantization support for Fused MoE models with enhanced alignment handling for weights and scales.
    • Enhanced stability when deploying quantized expert models with tensor parallelism configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…TLLMGenFusedMoE BaseMethod

Signed-off-by: leslie-fang25 <leslief@nvidia.com>
@leslie-fang25 leslie-fang25 requested a review from a team as a code owner March 9, 2026 09:56
@leslie-fang25 leslie-fang25 requested a review from QiJune March 9, 2026 09:56
@leslie-fang25
Copy link
Collaborator Author

/bot run --disable-fail-fast

@leslie-fang25 leslie-fang25 requested a review from xxi-nv March 9, 2026 09:57
@tensorrt-cicd
Copy link
Collaborator

PR_Github #38244 [ run ] triggered by Bot. Commit: 606d171 Link to invocation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This pull request introduces alignment-aware padding for NVFP4 TRT Gen Fused MoE weights, adding helper constants and methods to compute aligned shapes, and modifying weight loading methods to pad weights and scales according to hardware alignment requirements. A test utility condition guarding NVFP4 quantization is also removed.

Changes

Cohort / File(s) Summary
Alignment-aware weight padding for NVFP4 Fused MoE
tensorrt_llm/_torch/modules/fused_moe/quantization.py
Adds alignment constants (_scale_m_alignment, _scale_k_alignment) and _round_up helper; introduces get_weights_shapes method to compute padded intermediate and weight shapes; modifies load_expert_w3_w1_weight, load_expert_w2_weight, and corresponding scale-loading methods to zero buffers, handle padding, and apply alignment shuffling/interleaving logic.
Test alignment guard removal
tests/unittest/_torch/modules/moe/moe_test_utils.py
Removes QuantAlgo.NVFP4 from crash-prone quantization algorithms set, narrowing alignment-related test gating condition for TP sharding scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The pull request description is incomplete. Only the template structure is provided with sections left blank or unfilled. Fill in the 'Description' and 'Test Coverage' sections with concrete details about what was changed and why, and list the relevant test cases that validate the changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: enabling weight and weight scale padding for NVFP4 in the TRTLLMGenFusedMoE BaseMethod, which matches the core modifications in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)

2658-2689: ⚠️ Potential issue | 🔴 Critical

Pad the non-gated w1 shard before copying into the aligned buffer.

Once get_weights_shapes() returns a padded w3_w1 tensor, the non-gated branch still copies the unpadded shard into the full destination tensor. copy_ will raise for any ReLU2/Nemotron-H expert whose intermediate size rounds up here, and the same method is reused for padded biases.

🐛 Minimal fix
         else:
             # Non-gated activation (e.g., ReLU2): buffer only contains w1
+            w1_weight_shard = _pad_tensor_to_shape(
+                w1_weight_shard, dst_w3_w1_weight_gpu.shape)
             dst_w3_w1_weight_gpu.copy_(
                 w1_weight_shard.view(dst_w3_w1_weight_gpu.dtype))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 2658 -
2689, The non-gated branch currently copies the raw w1_weight_shard into
dst_w3_w1_weight_gpu which can be larger due to alignment padding from
get_weights_shapes(); instead pad or expand w1_weight_shard to match
dst_w3_w1_weight_gpu.shape (keeping values at the start and zeros for the
padding) before calling dst_w3_w1_weight_gpu.copy_; locate the non-gated branch
that checks module.is_gated_activation and modify the else block so it
constructs a padded tensor (same dtype/device as dst_w3_w1_weight_gpu) sized to
dst_w3_w1_weight_gpu.shape[0] (or uses the same padded_half logic used above
when needed) and then copy_ that padded shard; apply the same padding approach
for the corresponding bias code paths that reuse this method.

2772-2805: ⚠️ Potential issue | 🔴 Critical

Pad the non-gated scale shard before the aligned copy.

This branch has the same shape assumption as the weight loader. After w3_w1_weight_scale is padded for the 128-row requirement, copying the raw w1_weight_scale tensor into the full destination will fail on non-gated experts before shuffle/interleave runs.

🐛 Minimal fix
         else:
             # Non-gated activation (e.g., ReLU2): buffer only contains w1 scale
+            w1_weight_scale = _pad_tensor_to_shape(
+                w1_weight_scale, dst_w3_w1_weight_scale_gpu.shape)
             dst_w3_w1_weight_scale_gpu.copy_(
                 w1_weight_scale.view(dst_w3_w1_weight_scale_gpu.dtype))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 2772 -
2805, The non-gated branch currently copies w1_weight_scale directly into
dst_w3_w1_weight_scale_gpu but doesn't pad the source to the padded row count,
causing shape mismatch for non-gated experts; before calling
dst_w3_w1_weight_scale_gpu.copy_(...), create a padded version of
w1_weight_scale matching dst_w3_w1_weight_scale_gpu.shape[0] (and dtype) — e.g.,
allocate a temp tensor of the destination length, zero it, copy the original
w1_weight_scale.view(dst_dtype) into the start, then copy that padded tensor
into dst_w3_w1_weight_scale_gpu; update the code around
module.is_gated_activation else branch and use dst_w3_w1_weight_scale_gpu,
w1_weight_scale, and dst dtype references to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 2574-2615: get_weights_shapes currently pads
module.expand_intermediate_size_per_partition and
module.intermediate_size_per_partition independently which can make w3_w1 and w2
disagree on the padded intermediate extent; instead compute a single
padded_intermediate_extent by rounding both values to their required alignments
and taking the max (e.g., expand_padded =
self._round_up(module.expand_intermediate_size_per_partition,
self._scale_m_alignment), inter_padded =
self._round_up(module.intermediate_size_per_partition,
module.scaling_vector_size * block_scales_vec_size * self._scale_k_alignment),
padded_intermediate = max(expand_padded, inter_padded)) and then use
padded_intermediate for the intermediate dimension in w3_w1_weight_shape,
w3_w1_weight_scale_shape, w3_w1_bias_shape and use
inter_padded/padded_intermediate consistently for w2_weight_shape,
w2_weight_scale_shape and w2_bias_shape so both GEMM1 and GEMM2 derive their
intermediate size from the same padded extent.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 2658-2689: The non-gated branch currently copies the raw
w1_weight_shard into dst_w3_w1_weight_gpu which can be larger due to alignment
padding from get_weights_shapes(); instead pad or expand w1_weight_shard to
match dst_w3_w1_weight_gpu.shape (keeping values at the start and zeros for the
padding) before calling dst_w3_w1_weight_gpu.copy_; locate the non-gated branch
that checks module.is_gated_activation and modify the else block so it
constructs a padded tensor (same dtype/device as dst_w3_w1_weight_gpu) sized to
dst_w3_w1_weight_gpu.shape[0] (or uses the same padded_half logic used above
when needed) and then copy_ that padded shard; apply the same padding approach
for the corresponding bias code paths that reuse this method.
- Around line 2772-2805: The non-gated branch currently copies w1_weight_scale
directly into dst_w3_w1_weight_scale_gpu but doesn't pad the source to the
padded row count, causing shape mismatch for non-gated experts; before calling
dst_w3_w1_weight_scale_gpu.copy_(...), create a padded version of
w1_weight_scale matching dst_w3_w1_weight_scale_gpu.shape[0] (and dtype) — e.g.,
allocate a temp tensor of the destination length, zero it, copy the original
w1_weight_scale.view(dst_dtype) into the start, then copy that padded tensor
into dst_w3_w1_weight_scale_gpu; update the code around
module.is_gated_activation else branch and use dst_w3_w1_weight_scale_gpu,
w1_weight_scale, and dst dtype references to locate the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7538ec66-e937-4c4f-b856-618f434bde32

📥 Commits

Reviewing files that changed from the base of the PR and between 5b4ff40 and 606d171.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
  • tests/unittest/_torch/modules/moe/moe_test_utils.py
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/modules/moe/moe_test_utils.py

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38244 [ run ] completed with state SUCCESS. Commit: 606d171
/LLM/main/L0_MergeRequest_PR pipeline #29630 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@leslie-fang25
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38332 [ run ] triggered by Bot. Commit: 606d171 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38332 [ run ] completed with state SUCCESS. Commit: 606d171
/LLM/main/L0_MergeRequest_PR pipeline #29708 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@leslie-fang25
Copy link
Collaborator Author

/bot run --disable-fail-fast --stage-list "DGX_B200-4_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38355 [ run ] triggered by Bot. Commit: 606d171 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38355 [ run ] completed with state FAILURE. Commit: 606d171
/LLM/main/L0_MergeRequest_PR pipeline #29726 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@leslie-fang25
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38393 [ run ] triggered by Bot. Commit: 8e75a45 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants