[None][chore] Enable the weight and weight scale padding for NVFP4 TRTLLMGenFusedMoE BaseMethod#12031
[None][chore] Enable the weight and weight scale padding for NVFP4 TRTLLMGenFusedMoE BaseMethod#12031leslie-fang25 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
…TLLMGenFusedMoE BaseMethod Signed-off-by: leslie-fang25 <leslief@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #38244 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis pull request introduces alignment-aware padding for NVFP4 TRT Gen Fused MoE weights, adding helper constants and methods to compute aligned shapes, and modifying weight loading methods to pad weights and scales according to hardware alignment requirements. A test utility condition guarding NVFP4 quantization is also removed. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
2658-2689:⚠️ Potential issue | 🔴 CriticalPad the non-gated
w1shard before copying into the aligned buffer.Once
get_weights_shapes()returns a paddedw3_w1tensor, the non-gated branch still copies the unpadded shard into the full destination tensor.copy_will raise for any ReLU2/Nemotron-H expert whose intermediate size rounds up here, and the same method is reused for padded biases.🐛 Minimal fix
else: # Non-gated activation (e.g., ReLU2): buffer only contains w1 + w1_weight_shard = _pad_tensor_to_shape( + w1_weight_shard, dst_w3_w1_weight_gpu.shape) dst_w3_w1_weight_gpu.copy_( w1_weight_shard.view(dst_w3_w1_weight_gpu.dtype))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 2658 - 2689, The non-gated branch currently copies the raw w1_weight_shard into dst_w3_w1_weight_gpu which can be larger due to alignment padding from get_weights_shapes(); instead pad or expand w1_weight_shard to match dst_w3_w1_weight_gpu.shape (keeping values at the start and zeros for the padding) before calling dst_w3_w1_weight_gpu.copy_; locate the non-gated branch that checks module.is_gated_activation and modify the else block so it constructs a padded tensor (same dtype/device as dst_w3_w1_weight_gpu) sized to dst_w3_w1_weight_gpu.shape[0] (or uses the same padded_half logic used above when needed) and then copy_ that padded shard; apply the same padding approach for the corresponding bias code paths that reuse this method.
2772-2805:⚠️ Potential issue | 🔴 CriticalPad the non-gated scale shard before the aligned copy.
This branch has the same shape assumption as the weight loader. After
w3_w1_weight_scaleis padded for the 128-row requirement, copying the raww1_weight_scaletensor into the full destination will fail on non-gated experts before shuffle/interleave runs.🐛 Minimal fix
else: # Non-gated activation (e.g., ReLU2): buffer only contains w1 scale + w1_weight_scale = _pad_tensor_to_shape( + w1_weight_scale, dst_w3_w1_weight_scale_gpu.shape) dst_w3_w1_weight_scale_gpu.copy_( w1_weight_scale.view(dst_w3_w1_weight_scale_gpu.dtype))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py` around lines 2772 - 2805, The non-gated branch currently copies w1_weight_scale directly into dst_w3_w1_weight_scale_gpu but doesn't pad the source to the padded row count, causing shape mismatch for non-gated experts; before calling dst_w3_w1_weight_scale_gpu.copy_(...), create a padded version of w1_weight_scale matching dst_w3_w1_weight_scale_gpu.shape[0] (and dtype) — e.g., allocate a temp tensor of the destination length, zero it, copy the original w1_weight_scale.view(dst_dtype) into the start, then copy that padded tensor into dst_w3_w1_weight_scale_gpu; update the code around module.is_gated_activation else branch and use dst_w3_w1_weight_scale_gpu, w1_weight_scale, and dst dtype references to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 2574-2615: get_weights_shapes currently pads
module.expand_intermediate_size_per_partition and
module.intermediate_size_per_partition independently which can make w3_w1 and w2
disagree on the padded intermediate extent; instead compute a single
padded_intermediate_extent by rounding both values to their required alignments
and taking the max (e.g., expand_padded =
self._round_up(module.expand_intermediate_size_per_partition,
self._scale_m_alignment), inter_padded =
self._round_up(module.intermediate_size_per_partition,
module.scaling_vector_size * block_scales_vec_size * self._scale_k_alignment),
padded_intermediate = max(expand_padded, inter_padded)) and then use
padded_intermediate for the intermediate dimension in w3_w1_weight_shape,
w3_w1_weight_scale_shape, w3_w1_bias_shape and use
inter_padded/padded_intermediate consistently for w2_weight_shape,
w2_weight_scale_shape and w2_bias_shape so both GEMM1 and GEMM2 derive their
intermediate size from the same padded extent.
---
Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 2658-2689: The non-gated branch currently copies the raw
w1_weight_shard into dst_w3_w1_weight_gpu which can be larger due to alignment
padding from get_weights_shapes(); instead pad or expand w1_weight_shard to
match dst_w3_w1_weight_gpu.shape (keeping values at the start and zeros for the
padding) before calling dst_w3_w1_weight_gpu.copy_; locate the non-gated branch
that checks module.is_gated_activation and modify the else block so it
constructs a padded tensor (same dtype/device as dst_w3_w1_weight_gpu) sized to
dst_w3_w1_weight_gpu.shape[0] (or uses the same padded_half logic used above
when needed) and then copy_ that padded shard; apply the same padding approach
for the corresponding bias code paths that reuse this method.
- Around line 2772-2805: The non-gated branch currently copies w1_weight_scale
directly into dst_w3_w1_weight_scale_gpu but doesn't pad the source to the
padded row count, causing shape mismatch for non-gated experts; before calling
dst_w3_w1_weight_scale_gpu.copy_(...), create a padded version of
w1_weight_scale matching dst_w3_w1_weight_scale_gpu.shape[0] (and dtype) — e.g.,
allocate a temp tensor of the destination length, zero it, copy the original
w1_weight_scale.view(dst_dtype) into the start, then copy that padded tensor
into dst_w3_w1_weight_scale_gpu; update the code around
module.is_gated_activation else branch and use dst_w3_w1_weight_scale_gpu,
w1_weight_scale, and dst dtype references to locate the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7538ec66-e937-4c4f-b856-618f434bde32
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/fused_moe/quantization.pytests/unittest/_torch/modules/moe/moe_test_utils.py
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/modules/moe/moe_test_utils.py
|
PR_Github #38244 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #38332 [ run ] triggered by Bot. Commit: |
|
PR_Github #38332 [ run ] completed with state
|
|
/bot run --disable-fail-fast --stage-list "DGX_B200-4_GPUs-PyTorch-1" |
|
PR_Github #38355 [ run ] triggered by Bot. Commit: |
|
PR_Github #38355 [ run ] completed with state
|
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #38393 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.