[None][chore] Enable the weight and weight scale padding for NVFP4 TRTLLMGenFusedMoE BaseMethod by leslie-fang25 · Pull Request #12031 · NVIDIA/TensorRT-LLM

leslie-fang25 · 2026-03-09T09:56:32Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved NVFP4 quantization support for Fused MoE models with enhanced alignment handling for weights and scales.
- Enhanced stability when deploying quantized expert models with tensor parallelism configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…TLLMGenFusedMoE BaseMethod Signed-off-by: leslie-fang25 <leslief@nvidia.com>

leslie-fang25 · 2026-03-09T09:57:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-09T10:02:28Z

PR_Github #38244 [ run ] triggered by Bot. Commit: 606d171 Link to invocation

coderabbitai · 2026-03-09T10:06:22Z

📝 Walkthrough

Walkthrough

This pull request introduces alignment-aware padding for NVFP4 TRT Gen Fused MoE weights, adding helper constants and methods to compute aligned shapes, and modifying weight loading methods to pad weights and scales according to hardware alignment requirements. A test utility condition guarding NVFP4 quantization is also removed.

Changes

Cohort / File(s)	Summary
Alignment-aware weight padding for NVFP4 Fused MoE `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Adds alignment constants (`_scale_m_alignment`, `_scale_k_alignment`) and `_round_up` helper; introduces `get_weights_shapes` method to compute padded intermediate and weight shapes; modifies `load_expert_w3_w1_weight`, `load_expert_w2_weight`, and corresponding scale-loading methods to zero buffers, handle padding, and apply alignment shuffling/interleaving logic.
Test alignment guard removal `tests/unittest/_torch/modules/moe/moe_test_utils.py`	Removes `QuantAlgo.NVFP4` from crash-prone quantization algorithms set, narrowing alignment-related test gating condition for TP sharding scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The pull request description is incomplete. Only the template structure is provided with sections left blank or unfilled.	Fill in the 'Description' and 'Test Coverage' sections with concrete details about what was changed and why, and list the relevant test cases that validate the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: enabling weight and weight scale padding for NVFP4 in the TRTLLMGenFusedMoE BaseMethod, which matches the core modifications in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)

2658-2689: ⚠️ Potential issue | 🔴 Critical

Pad the non-gated w1 shard before copying into the aligned buffer.

Once get_weights_shapes() returns a padded w3_w1 tensor, the non-gated branch still copies the unpadded shard into the full destination tensor. copy_ will raise for any ReLU2/Nemotron-H expert whose intermediate size rounds up here, and the same method is reused for padded biases.

🐛 Minimal fix

         else:
             # Non-gated activation (e.g., ReLU2): buffer only contains w1
+            w1_weight_shard = _pad_tensor_to_shape(
+                w1_weight_shard, dst_w3_w1_weight_gpu.shape)
             dst_w3_w1_weight_gpu.copy_(
                 w1_weight_shard.view(dst_w3_w1_weight_gpu.dtype))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 2658 -
2689, The non-gated branch currently copies the raw w1_weight_shard into
dst_w3_w1_weight_gpu which can be larger due to alignment padding from
get_weights_shapes(); instead pad or expand w1_weight_shard to match
dst_w3_w1_weight_gpu.shape (keeping values at the start and zeros for the
padding) before calling dst_w3_w1_weight_gpu.copy_; locate the non-gated branch
that checks module.is_gated_activation and modify the else block so it
constructs a padded tensor (same dtype/device as dst_w3_w1_weight_gpu) sized to
dst_w3_w1_weight_gpu.shape[0] (or uses the same padded_half logic used above
when needed) and then copy_ that padded shard; apply the same padding approach
for the corresponding bias code paths that reuse this method.

2772-2805: ⚠️ Potential issue | 🔴 Critical

Pad the non-gated scale shard before the aligned copy.

This branch has the same shape assumption as the weight loader. After w3_w1_weight_scale is padded for the 128-row requirement, copying the raw w1_weight_scale tensor into the full destination will fail on non-gated experts before shuffle/interleave runs.

🐛 Minimal fix

         else:
             # Non-gated activation (e.g., ReLU2): buffer only contains w1 scale
+            w1_weight_scale = _pad_tensor_to_shape(
+                w1_weight_scale, dst_w3_w1_weight_scale_gpu.shape)
             dst_w3_w1_weight_scale_gpu.copy_(
                 w1_weight_scale.view(dst_w3_w1_weight_scale_gpu.dtype))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 2772 -
2805, The non-gated branch currently copies w1_weight_scale directly into
dst_w3_w1_weight_scale_gpu but doesn't pad the source to the padded row count,
causing shape mismatch for non-gated experts; before calling
dst_w3_w1_weight_scale_gpu.copy_(...), create a padded version of
w1_weight_scale matching dst_w3_w1_weight_scale_gpu.shape[0] (and dtype) — e.g.,
allocate a temp tensor of the destination length, zero it, copy the original
w1_weight_scale.view(dst_dtype) into the start, then copy that padded tensor
into dst_w3_w1_weight_scale_gpu; update the code around
module.is_gated_activation else branch and use dst_w3_w1_weight_scale_gpu,
w1_weight_scale, and dst dtype references to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 2574-2615: get_weights_shapes currently pads
module.expand_intermediate_size_per_partition and
module.intermediate_size_per_partition independently which can make w3_w1 and w2
disagree on the padded intermediate extent; instead compute a single
padded_intermediate_extent by rounding both values to their required alignments
and taking the max (e.g., expand_padded =
self._round_up(module.expand_intermediate_size_per_partition,
self._scale_m_alignment), inter_padded =
self._round_up(module.intermediate_size_per_partition,
module.scaling_vector_size * block_scales_vec_size * self._scale_k_alignment),
padded_intermediate = max(expand_padded, inter_padded)) and then use
padded_intermediate for the intermediate dimension in w3_w1_weight_shape,
w3_w1_weight_scale_shape, w3_w1_bias_shape and use
inter_padded/padded_intermediate consistently for w2_weight_shape,
w2_weight_scale_shape and w2_bias_shape so both GEMM1 and GEMM2 derive their
intermediate size from the same padded extent.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 2658-2689: The non-gated branch currently copies the raw
w1_weight_shard into dst_w3_w1_weight_gpu which can be larger due to alignment
padding from get_weights_shapes(); instead pad or expand w1_weight_shard to
match dst_w3_w1_weight_gpu.shape (keeping values at the start and zeros for the
padding) before calling dst_w3_w1_weight_gpu.copy_; locate the non-gated branch
that checks module.is_gated_activation and modify the else block so it
constructs a padded tensor (same dtype/device as dst_w3_w1_weight_gpu) sized to
dst_w3_w1_weight_gpu.shape[0] (or uses the same padded_half logic used above
when needed) and then copy_ that padded shard; apply the same padding approach
for the corresponding bias code paths that reuse this method.
- Around line 2772-2805: The non-gated branch currently copies w1_weight_scale
directly into dst_w3_w1_weight_scale_gpu but doesn't pad the source to the
padded row count, causing shape mismatch for non-gated experts; before calling
dst_w3_w1_weight_scale_gpu.copy_(...), create a padded version of
w1_weight_scale matching dst_w3_w1_weight_scale_gpu.shape[0] (and dtype) — e.g.,
allocate a temp tensor of the destination length, zero it, copy the original
w1_weight_scale.view(dst_dtype) into the start, then copy that padded tensor
into dst_w3_w1_weight_scale_gpu; update the code around
module.is_gated_activation else branch and use dst_w3_w1_weight_scale_gpu,
w1_weight_scale, and dst dtype references to locate the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7538ec66-e937-4c4f-b856-618f434bde32

📥 Commits

Reviewing files that changed from the base of the PR and between 5b4ff40 and 606d171.

📒 Files selected for processing (2)

tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/modules/moe/moe_test_utils.py

💤 Files with no reviewable changes (1)

tests/unittest/_torch/modules/moe/moe_test_utils.py

tensorrt_llm/_torch/modules/fused_moe/quantization.py

tensorrt-cicd · 2026-03-09T14:11:55Z

PR_Github #38244 [ run ] completed with state SUCCESS. Commit: 606d171
/LLM/main/L0_MergeRequest_PR pipeline #29630 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

leslie-fang25 · 2026-03-09T23:23:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-09T23:30:48Z

PR_Github #38332 [ run ] triggered by Bot. Commit: 606d171 Link to invocation

tensorrt-cicd · 2026-03-10T01:48:23Z

PR_Github #38332 [ run ] completed with state SUCCESS. Commit: 606d171
/LLM/main/L0_MergeRequest_PR pipeline #29708 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

leslie-fang25 · 2026-03-10T02:40:37Z

/bot run --disable-fail-fast --stage-list "DGX_B200-4_GPUs-PyTorch-1"

tensorrt-cicd · 2026-03-10T02:46:29Z

PR_Github #38355 [ run ] triggered by Bot. Commit: 606d171 Link to invocation

tensorrt-cicd · 2026-03-10T03:19:09Z

PR_Github #38355 [ run ] completed with state FAILURE. Commit: 606d171
/LLM/main/L0_MergeRequest_PR pipeline #29726 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: leslie-fang25 <leslief@nvidia.com>

leslie-fang25 · 2026-03-10T06:26:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-10T06:32:07Z

PR_Github #38393 [ run ] triggered by Bot. Commit: 8e75a45 Link to invocation

[None][chore] Enable the weight and weight scale padding for NVFP4 TR…

606d171

…TLLMGenFusedMoE BaseMethod Signed-off-by: leslie-fang25 <leslief@nvidia.com>

leslie-fang25 requested a review from a team as a code owner March 9, 2026 09:56

leslie-fang25 requested a review from QiJune March 9, 2026 09:56

github-actions bot assigned leslie-fang25 Mar 9, 2026

leslie-fang25 requested a review from xxi-nv March 9, 2026 09:57

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

tensorrt_llm/_torch/modules/fused_moe/quantization.py Show resolved Hide resolved

leslie-fang25 added 2 commits March 9, 2026 22:48

fix the alignment for w1_w3 and w2

22e5595

Signed-off-by: leslie-fang25 <leslief@nvidia.com>

Merge branch 'main' into leslie/fix_nvfp4_padding

8e75a45

Conversation

leslie-fang25 commented Mar 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

leslie-fang25 commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

leslie-fang25 commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

leslie-fang25 commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

leslie-fang25 commented Mar 10, 2026

Uh oh!

tensorrt-cicd commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leslie-fang25 commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading