fix: fix Dtensor sharding error when bump up pytorch version #1557

ZhiyuLi-Nvidia · 2025-11-21T03:54:13Z

What does this PR do ?

Successful run after the fix with tp2 sq enabled in qwen model:
https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/nyq6n98w/overview?nw=nwuserzhiyul

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Enabled sequence parallelism to work in combination with tensor parallelism for improved distributed training performance.
- Enhanced parallelization strategies for Llama, Qwen, and Gemma3 models with optimized activation handling in parallel configurations.
Bug Fixes
- Removed runtime error that previously blocked sequence parallelism when using tensor parallelism, enabling more flexible model parallelization options.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

github-actions · 2025-11-21T03:54:36Z

⚠️ File Consistency Check

Check based on commit: 8c58b54 (PR #1557 from zhiyul/fix_sharding_qwen)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
Update the appropriate related file(s) if necessary to maintain functional consistency
Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/dtensor/parallelize.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker_v2.py
Update nemo_rl/models/policy/dtensor_policy_worker_v2.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

coderabbitai · 2025-11-21T04:00:39Z

📝 Walkthrough

Walkthrough

Introduces SequenceParallelAllGatherActivation class to redistribute sequence-parallel DTensor activations from Shard to Replicate placements across normalization and attention layers in multiple model architectures. Also removes the runtime restriction preventing sequence parallelism when tensor parallelism is enabled.

Changes

Cohort / File(s)	Change Summary
SequenceParallelAllGatherActivation implementation `nemo_rl/models/dtensor/parallelize.py`	Adds new `SequenceParallelAllGatherActivation` class extending `SequenceParallel` with a `_prepare_output_fn` static method that redistributes DTensors with Shard placements to Replicate before delegating to parent class.
Model parallelization strategy updates `nemo_rl/models/dtensor/parallelize.py`	Updates model-parallel mappings for Llama, Qwen, and Gemma3 architectures to use `SequenceParallelAllGatherActivation` for `input_layernorm` and `post_attention_layernorm` layers; adds `gate_up_proj` projections to parallelization plans; configures `use_local_output=False` where applicable.
Tensor parallelism restriction removal `nemo_rl/models/policy/dtensor_policy_worker.py`	Removes runtime error that blocked sequence parallelism when `tp_size > 1`; retains warning for `tp_size == 1` case.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Activation<br/>(Shard placement)
    participant SPAGA as SequenceParallelAllGatherActivation
    participant Parent as SequenceParallel._prepare_output_fn
    participant Output as Output Activation<br/>(Replicate placement)
    
    Input->>SPAGA: outputs (DTensor with Shard)
    activate SPAGA
    SPAGA->>SPAGA: Check for Shard placement
    alt Has Shard placement
        SPAGA->>SPAGA: Redistribute to Replicate
    end
    SPAGA->>Parent: Call parent _prepare_output_fn
    activate Parent
    Parent->>Output: Apply parent logic
    deactivate Parent
    SPAGA->>Output: Return redistributed output
    deactivate SPAGA

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

SequenceParallelAllGatherActivation class logic: Verify DTensor redistribution from Shard to Replicate is correct and compatible with parent class delegation
Model architecture consistency: Ensure Llama, Qwen, and Gemma3 updates follow the same pattern for normalization and projection layers
Gate_up_proj addition: Confirm the new model.layers.*.mlp.gate_up_proj (ColwiseParallel) entry is correctly integrated
tp_size restriction removal: Validate that removing the tp_size > 1 guard doesn't violate other assumptions in state loading or model initialization

Suggested labels

r0.4.0

Suggested reviewers

yaoyu-33
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR introduces major distributed training changes without test results, performance benchmarks, convergence validation, or CI/CD documentation.	Add comprehensive testing information including numerical correctness validation, performance benchmarks, distributed training verification, and CI/CD test run references.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title describes fixing a DTensor sharding error related to PyTorch version bumping, which is partially related to the changes. However, the actual changes involve introducing SequenceParallelAllGatherActivation and removing sequence-parallel restrictions with tensor parallelism, which is broader than just fixing a sharding error.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch zhiyul/fix_sharding_qwen

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/dtensor/parallelize.py (1)

156-166: Update Gemma3 parallelization to match Llama and Qwen fixes.

The _parallelize_gemma3 function uses SequenceParallel() for normalization layers (lines 156, 160, 161, 165), while _parallelize_llama and _parallelize_qwen have been updated to use SequenceParallelAllGatherActivation() for the same layers. Since Gemma3 has identical layer structure and is architecturally related to Qwen2 (see line 91 comment), it likely encounters the same PyTorch 2.8.0 DTensor sharding issue and should apply the same fix:

Line 156: input_layernorm should use SequenceParallelAllGatherActivation(use_local_output=False)

Line 160: post_attention_layernorm should use SequenceParallelAllGatherActivation(use_local_output=False)

🧹 Nitpick comments (1)

nemo_rl/models/dtensor/parallelize.py (1)
72-87: Good addition, but consider using TypeError for type validation.

The SequenceParallelAllGatherActivation class correctly implements all-gather behavior for sharded DTensor activations. However, consider the following improvements:

Use TypeError instead of ValueError: When validating types, TypeError is more semantically appropriate.

Handle edge cases: Consider what happens if outputs is a tuple/list of DTensors or other composite structures.

Apply this diff to use TypeError:
-        else:
-            raise ValueError(f"Expected output to be a DTensor, but got {type(outputs)}")
+        else:
+            raise TypeError(f"Expected output to be a DTensor, but got {type(outputs)}")
Based on static analysis hints.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c371a9 and 8c58b54.

📒 Files selected for processing (2)

nemo_rl/models/dtensor/parallelize.py (3 hunks)
nemo_rl/models/policy/dtensor_policy_worker.py (0 hunks)

💤 Files with no reviewable changes (1)

nemo_rl/models/policy/dtensor_policy_worker.py

🧰 Additional context used

🪛 Ruff (0.14.5)

nemo_rl/models/dtensor/parallelize.py

84-84: Prefer TypeError exception for invalid type

(TRY004)

84-84: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (2)

nemo_rl/models/dtensor/parallelize.py (2)

264-264: Good addition of fused gate_up_proj support.

Adding support for the fused gate_up_proj projection is a sensible optimization that some Qwen model variants use instead of separate gate_proj and up_proj layers.

201-203: Approved: Correct application of all-gather for normalization layers.

The use of SequenceParallelAllGatherActivation(use_local_output=False) at lines 201 and 203 ensures proper activation gathering before attention and MLP operations. The pattern is consistent within the Llama configuration.

coderabbitai · 2025-11-21T04:00:42Z

nemo_rl/models/dtensor/parallelize.py

+            "model.layers.*.input_layernorm": SequenceParallelAllGatherActivation(),
+            "model.layers.*.self_attn.q_proj": ColwiseParallel(),
+            "model.layers.*.self_attn.k_proj": ColwiseParallel(),
+            "model.layers.*.self_attn.v_proj": ColwiseParallel(),
+            "model.layers.*.self_attn.qkv_proj": ColwiseParallel(),
            "model.layers.*.self_attn.o_proj": RowwiseParallel(output_layouts=Shard(1)),
            "model.layers.*.self_attn.q_norm": Qwen3QKNorm(),
            "model.layers.*.self_attn.k_norm": Qwen3QKNorm(),
-            "model.layers.*.post_attention_layernorm": SequenceParallel(),
+            "model.layers.*.post_attention_layernorm": SequenceParallelAllGatherActivation(),


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify the omission of use_local_output parameter.

In _parallelize_llama, the SequenceParallelAllGatherActivation is instantiated with use_local_output=False (lines 201, 203), but here in _parallelize_qwen, this parameter is omitted for both input_layernorm (line 253) and post_attention_layernorm (line 261). This inconsistency could lead to different behavior between model architectures.

Please verify whether this difference is intentional or if use_local_output=False should be added for consistency:

🏁 Script executed:

#!/bin/bash # Check the default value of use_local_output in SequenceParallel ast-grep --pattern $'class SequenceParallel($$$) { $$$ def __init__(self, $$$use_local_output$$$) { $$$ } $$$ }'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for SequenceParallelAllGatherActivation class definition rg -n "class SequenceParallelAllGatherActivation" --type python

Length of output: 30

🏁 Script executed:

#!/bin/bash # Search for SequenceParallelAllGatherActivation usages including __init__ or def rg -n "SequenceParallelAllGatherActivation" --type python -C 3

Length of output: 30

🏁 Script executed:

#!/bin/bash # Look at the actual code around lines 253-261 in parallelize.py sed -n '250,265p' nemo_rl/models/dtensor/parallelize.py

Length of output: 1035

🏁 Script executed:

#!/bin/bash # Look at the actual code around lines 201-203 in parallelize.py sed -n '198,210p' nemo_rl/models/dtensor/parallelize.py

Length of output: 662

🏁 Script executed:

#!/bin/bash # Search for SequenceParallelAllGatherActivation class definition without type flag rg -n "class SequenceParallelAllGatherActivation"

Length of output: 101

🏁 Script executed:

#!/bin/bash # Check imports at the top of parallelize.py head -50 nemo_rl/models/dtensor/parallelize.py | grep -n "SequenceParallel"

Length of output: 24

🏁 Script executed:

#!/bin/bash # Find where SequenceParallelAllGatherActivation is defined or imported rg -n "SequenceParallelAllGatherActivation" -A 5 | head -100

Length of output: 2759

🏁 Script executed:

#!/bin/bash # Get the full definition of SequenceParallelAllGatherActivation class sed -n '72,120p' nemo_rl/models/dtensor/parallelize.py

Length of output: 2246

🏁 Script executed:

#!/bin/bash # Search for the parent SequenceParallel class or its import source rg -n "class SequenceParallel" --type-add py:*.py -t py

Length of output: 101

🏁 Script executed:

#!/bin/bash # Check what SequenceParallel is imported from head -40 nemo_rl/models/dtensor/parallelize.py

Length of output: 1399

[/web_search]

🌐 Web query:

PyTorch SequenceParallel use_local_output parameter default value

💡 Result:

The default value is False — SequenceParallel(..., use_local_output=False). [1][2]

Sources:
[1] PyTorch docs — torch.distributed.tensor.parallel (SequenceParallel signature).
[2] PyTorch tutorial / docs examples showing use_local_output=False for SequenceParallel.

🏁 Script executed:

#!/bin/bash # Find all _parallelize functions and check their SequenceParallelAllGatherActivation usage rg -n "def _parallelize" nemo_rl/models/dtensor/parallelize.py

Length of output: 139

🏁 Script executed:

#!/bin/bash # Check all SequenceParallelAllGatherActivation usages in the file rg -n "SequenceParallelAllGatherActivation" nemo_rl/models/dtensor/parallelize.py

Length of output: 473

🏁 Script executed:

#!/bin/bash # Check _parallelize_gemma3 for SequenceParallelAllGatherActivation usage sed -n '128,178p' nemo_rl/models/dtensor/parallelize.py

Length of output: 2426

🏁 Script executed:

#!/bin/bash # Check _parallelize_nm5_h for SequenceParallelAllGatherActivation usage sed -n '423,492p' nemo_rl/models/dtensor/parallelize.py

Length of output: 2203

Add use_local_output=False parameter for consistency with _parallelize_llama.

Lines 253 and 261 in _parallelize_qwen omit the use_local_output=False parameter in SequenceParallelAllGatherActivation(), while _parallelize_llama (lines 201, 203) explicitly specifies it. Although PyTorch's SequenceParallel defaults to use_local_output=False, making this parameter explicit in Qwen maintains consistency and clarity across model configurations. Update lines 253 and 261 to:

"model.layers.*.input_layernorm": SequenceParallelAllGatherActivation(use_local_output=False), "model.layers.*.post_attention_layernorm": SequenceParallelAllGatherActivation(use_local_output=False),

🤖 Prompt for AI Agents

In nemo_rl/models/dtensor/parallelize.py around lines 253 to 261, the SequenceParallelAllGatherActivation calls for "model.layers.*.input_layernorm" and "model.layers.*.post_attention_layernorm" omit the explicit use_local_output=False parameter; update those two invocations to include use_local_output=False to match _parallelize_llama and make behavior explicit and consistent across configurations.

fix: fix Dtensor sharding error when bump up pytorch version

8c58b54

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia requested a review from a team as a code owner November 21, 2025 03:54

ZhiyuLi-Nvidia mentioned this pull request Nov 21, 2025

Qwen parallelizer with sequence parallelism NVIDIA-NeMo/Automodel#652

Open

coderabbitai bot reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix Dtensor sharding error when bump up pytorch version #1557

fix: fix Dtensor sharding error when bump up pytorch version #1557

Uh oh!

ZhiyuLi-Nvidia commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

coderabbitai bot commented Nov 21, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: fix Dtensor sharding error when bump up pytorch version #1557

Are you sure you want to change the base?

fix: fix Dtensor sharding error when bump up pytorch version #1557

Uh oh!

Conversation

ZhiyuLi-Nvidia commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions bot commented Nov 21, 2025

⚠️ File Consistency Check

⚠️ Parallel Plans Synchronization Warning

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

coderabbitai bot commented Nov 21, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZhiyuLi-Nvidia commented Nov 21, 2025 •

edited

Loading