Skip to content

Conversation

@ZhiyuLi-Nvidia
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Nov 21, 2025

What does this PR do ?

Successful run after the fix with tp2 sq enabled in qwen model:
https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/nyq6n98w/overview?nw=nwuserzhiyul

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Enabled sequence parallelism to work in combination with tensor parallelism for improved distributed training performance.
    • Enhanced parallelization strategies for Llama, Qwen, and Gemma3 models with optimized activation handling in parallel configurations.
  • Bug Fixes

    • Removed runtime error that previously blocked sequence parallelism when using tensor parallelism, enabling more flexible model parallelization options.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia requested a review from a team as a code owner November 21, 2025 03:54
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 8c58b54 (PR #1557 from zhiyul/fix_sharding_qwen)

⚠️ Parallel Plans Synchronization Warning

The file nemo_rl/models/dtensor/parallelize.py was modified in this PR, but neither 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py nor 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py was updated.

Why this matters:
These files contain similar parallel plan implementations that should be kept synchronized to ensure consistency across the codebase.

Action required:

  • Please review if the changes in nemo_rl/models/dtensor/parallelize.py should also be applied to 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py or 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py
  • Update the appropriate related file(s) if necessary to maintain functional consistency
  • Request access to the NVIDIA-NeMo/Automodel repository, create a PR against the nemo-rl-submodule branch, and update the Automodel submodule in the nemo-rl index
  • Add @ffrujeri as a reviewer of this PR if you have any questions about the consistency requirements
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/dtensor/parallelize.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/optimized_tp_plans.py
  • Not modified: 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/distributed/parallelizer.py

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker_v2.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Update nemo_rl/models/policy/dtensor_policy_worker_v2.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

📝 Walkthrough

Walkthrough

Introduces SequenceParallelAllGatherActivation class to redistribute sequence-parallel DTensor activations from Shard to Replicate placements across normalization and attention layers in multiple model architectures. Also removes the runtime restriction preventing sequence parallelism when tensor parallelism is enabled.

Changes

Cohort / File(s) Change Summary
SequenceParallelAllGatherActivation implementation
nemo_rl/models/dtensor/parallelize.py
Adds new SequenceParallelAllGatherActivation class extending SequenceParallel with a _prepare_output_fn static method that redistributes DTensors with Shard placements to Replicate before delegating to parent class.
Model parallelization strategy updates
nemo_rl/models/dtensor/parallelize.py
Updates model-parallel mappings for Llama, Qwen, and Gemma3 architectures to use SequenceParallelAllGatherActivation for input_layernorm and post_attention_layernorm layers; adds gate_up_proj projections to parallelization plans; configures use_local_output=False where applicable.
Tensor parallelism restriction removal
nemo_rl/models/policy/dtensor_policy_worker.py
Removes runtime error that blocked sequence parallelism when tp_size > 1; retains warning for tp_size == 1 case.

Sequence Diagram(s)

sequenceDiagram
    participant Input as Input Activation<br/>(Shard placement)
    participant SPAGA as SequenceParallelAllGatherActivation
    participant Parent as SequenceParallel._prepare_output_fn
    participant Output as Output Activation<br/>(Replicate placement)
    
    Input->>SPAGA: outputs (DTensor with Shard)
    activate SPAGA
    SPAGA->>SPAGA: Check for Shard placement
    alt Has Shard placement
        SPAGA->>SPAGA: Redistribute to Replicate
    end
    SPAGA->>Parent: Call parent _prepare_output_fn
    activate Parent
    Parent->>Output: Apply parent logic
    deactivate Parent
    SPAGA->>Output: Return redistributed output
    deactivate SPAGA
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • SequenceParallelAllGatherActivation class logic: Verify DTensor redistribution from Shard to Replicate is correct and compatible with parent class delegation
  • Model architecture consistency: Ensure Llama, Qwen, and Gemma3 updates follow the same pattern for normalization and projection layers
  • Gate_up_proj addition: Confirm the new model.layers.*.mlp.gate_up_proj (ColwiseParallel) entry is correctly integrated
  • tp_size restriction removal: Validate that removing the tp_size > 1 guard doesn't violate other assumptions in state loading or model initialization

Suggested labels

r0.4.0

Suggested reviewers

  • yaoyu-33
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major distributed training changes without test results, performance benchmarks, convergence validation, or CI/CD documentation. Add comprehensive testing information including numerical correctness validation, performance benchmarks, distributed training verification, and CI/CD test run references.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title describes fixing a DTensor sharding error related to PyTorch version bumping, which is partially related to the changes. However, the actual changes involve introducing SequenceParallelAllGatherActivation and removing sequence-parallel restrictions with tensor parallelism, which is broader than just fixing a sharding error.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch zhiyul/fix_sharding_qwen

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/models/dtensor/parallelize.py (1)

156-166: Update Gemma3 parallelization to match Llama and Qwen fixes.

The _parallelize_gemma3 function uses SequenceParallel() for normalization layers (lines 156, 160, 161, 165), while _parallelize_llama and _parallelize_qwen have been updated to use SequenceParallelAllGatherActivation() for the same layers. Since Gemma3 has identical layer structure and is architecturally related to Qwen2 (see line 91 comment), it likely encounters the same PyTorch 2.8.0 DTensor sharding issue and should apply the same fix:

  • Line 156: input_layernorm should use SequenceParallelAllGatherActivation(use_local_output=False)
  • Line 160: post_attention_layernorm should use SequenceParallelAllGatherActivation(use_local_output=False)
🧹 Nitpick comments (1)
nemo_rl/models/dtensor/parallelize.py (1)

72-87: Good addition, but consider using TypeError for type validation.

The SequenceParallelAllGatherActivation class correctly implements all-gather behavior for sharded DTensor activations. However, consider the following improvements:

  1. Use TypeError instead of ValueError: When validating types, TypeError is more semantically appropriate.
  2. Handle edge cases: Consider what happens if outputs is a tuple/list of DTensors or other composite structures.

Apply this diff to use TypeError:

-        else:
-            raise ValueError(f"Expected output to be a DTensor, but got {type(outputs)}")
+        else:
+            raise TypeError(f"Expected output to be a DTensor, but got {type(outputs)}")

Based on static analysis hints.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c371a9 and 8c58b54.

📒 Files selected for processing (2)
  • nemo_rl/models/dtensor/parallelize.py (3 hunks)
  • nemo_rl/models/policy/dtensor_policy_worker.py (0 hunks)
💤 Files with no reviewable changes (1)
  • nemo_rl/models/policy/dtensor_policy_worker.py
🧰 Additional context used
🪛 Ruff (0.14.5)
nemo_rl/models/dtensor/parallelize.py

84-84: Prefer TypeError exception for invalid type

(TRY004)


84-84: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (2)
nemo_rl/models/dtensor/parallelize.py (2)

264-264: Good addition of fused gate_up_proj support.

Adding support for the fused gate_up_proj projection is a sensible optimization that some Qwen model variants use instead of separate gate_proj and up_proj layers.


201-203: Approved: Correct application of all-gather for normalization layers.

The use of SequenceParallelAllGatherActivation(use_local_output=False) at lines 201 and 203 ensures proper activation gathering before attention and MLP operations. The pattern is consistent within the Llama configuration.

Comment on lines +253 to +261
"model.layers.*.input_layernorm": SequenceParallelAllGatherActivation(),
"model.layers.*.self_attn.q_proj": ColwiseParallel(),
"model.layers.*.self_attn.k_proj": ColwiseParallel(),
"model.layers.*.self_attn.v_proj": ColwiseParallel(),
"model.layers.*.self_attn.qkv_proj": ColwiseParallel(),
"model.layers.*.self_attn.o_proj": RowwiseParallel(output_layouts=Shard(1)),
"model.layers.*.self_attn.q_norm": Qwen3QKNorm(),
"model.layers.*.self_attn.k_norm": Qwen3QKNorm(),
"model.layers.*.post_attention_layernorm": SequenceParallel(),
"model.layers.*.post_attention_layernorm": SequenceParallelAllGatherActivation(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify the omission of use_local_output parameter.

In _parallelize_llama, the SequenceParallelAllGatherActivation is instantiated with use_local_output=False (lines 201, 203), but here in _parallelize_qwen, this parameter is omitted for both input_layernorm (line 253) and post_attention_layernorm (line 261). This inconsistency could lead to different behavior between model architectures.

Please verify whether this difference is intentional or if use_local_output=False should be added for consistency:


🏁 Script executed:

#!/bin/bash
# Check the default value of use_local_output in SequenceParallel
ast-grep --pattern $'class SequenceParallel($$$) {
  $$$
  def __init__(self, $$$use_local_output$$$) {
    $$$
  }
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for SequenceParallelAllGatherActivation class definition
rg -n "class SequenceParallelAllGatherActivation" --type python

Length of output: 30


🏁 Script executed:

#!/bin/bash
# Search for SequenceParallelAllGatherActivation usages including __init__ or def
rg -n "SequenceParallelAllGatherActivation" --type python -C 3

Length of output: 30


🏁 Script executed:

#!/bin/bash
# Look at the actual code around lines 253-261 in parallelize.py
sed -n '250,265p' nemo_rl/models/dtensor/parallelize.py

Length of output: 1035


🏁 Script executed:

#!/bin/bash
# Look at the actual code around lines 201-203 in parallelize.py
sed -n '198,210p' nemo_rl/models/dtensor/parallelize.py

Length of output: 662


🏁 Script executed:

#!/bin/bash
# Search for SequenceParallelAllGatherActivation class definition without type flag
rg -n "class SequenceParallelAllGatherActivation"

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Check imports at the top of parallelize.py
head -50 nemo_rl/models/dtensor/parallelize.py | grep -n "SequenceParallel"

Length of output: 24


🏁 Script executed:

#!/bin/bash
# Find where SequenceParallelAllGatherActivation is defined or imported
rg -n "SequenceParallelAllGatherActivation" -A 5 | head -100

Length of output: 2759


🏁 Script executed:

#!/bin/bash
# Get the full definition of SequenceParallelAllGatherActivation class
sed -n '72,120p' nemo_rl/models/dtensor/parallelize.py

Length of output: 2246


🏁 Script executed:

#!/bin/bash
# Search for the parent SequenceParallel class or its import source
rg -n "class SequenceParallel" --type-add py:*.py -t py

Length of output: 101


🏁 Script executed:

#!/bin/bash
# Check what SequenceParallel is imported from
head -40 nemo_rl/models/dtensor/parallelize.py

Length of output: 1399


[/web_search]


🌐 Web query:

PyTorch SequenceParallel use_local_output parameter default value

💡 Result:

The default value is False — SequenceParallel(..., use_local_output=False). [1][2]

Sources:
[1] PyTorch docs — torch.distributed.tensor.parallel (SequenceParallel signature).
[2] PyTorch tutorial / docs examples showing use_local_output=False for SequenceParallel.


🏁 Script executed:

#!/bin/bash
# Find all _parallelize functions and check their SequenceParallelAllGatherActivation usage
rg -n "def _parallelize" nemo_rl/models/dtensor/parallelize.py

Length of output: 139


🏁 Script executed:

#!/bin/bash
# Check all SequenceParallelAllGatherActivation usages in the file
rg -n "SequenceParallelAllGatherActivation" nemo_rl/models/dtensor/parallelize.py

Length of output: 473


🏁 Script executed:

#!/bin/bash
# Check _parallelize_gemma3 for SequenceParallelAllGatherActivation usage
sed -n '128,178p' nemo_rl/models/dtensor/parallelize.py

Length of output: 2426


🏁 Script executed:

#!/bin/bash
# Check _parallelize_nm5_h for SequenceParallelAllGatherActivation usage
sed -n '423,492p' nemo_rl/models/dtensor/parallelize.py

Length of output: 2203


Add use_local_output=False parameter for consistency with _parallelize_llama.

Lines 253 and 261 in _parallelize_qwen omit the use_local_output=False parameter in SequenceParallelAllGatherActivation(), while _parallelize_llama (lines 201, 203) explicitly specifies it. Although PyTorch's SequenceParallel defaults to use_local_output=False, making this parameter explicit in Qwen maintains consistency and clarity across model configurations. Update lines 253 and 261 to:

"model.layers.*.input_layernorm": SequenceParallelAllGatherActivation(use_local_output=False),
"model.layers.*.post_attention_layernorm": SequenceParallelAllGatherActivation(use_local_output=False),
🤖 Prompt for AI Agents
In nemo_rl/models/dtensor/parallelize.py around lines 253 to 261, the
SequenceParallelAllGatherActivation calls for "model.layers.*.input_layernorm"
and "model.layers.*.post_attention_layernorm" omit the explicit
use_local_output=False parameter; update those two invocations to include
use_local_output=False to match _parallelize_llama and make behavior explicit
and consistent across configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants