Skip to content

Conversation

@adil-a
Copy link

@adil-a adil-a commented Nov 12, 2025

What does this PR do ?

Uses Automodel's FSDP2 manager for initializing the v2 worker.

image image image

Sharding on current main:

2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213) ================================================================================
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213) [PARALLELISM CONFIG]
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   world_size = 16
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   tensor_parallel_size (TP) = 1
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   context_parallel_size (CP) = 1
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   data_parallel_size (DP/FSDP) = 16
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   data_parallel_replicate_size = 1
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   sequence_parallel = False
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   FSDP shards model across 16 workers
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Each worker has ~1/16 of model parameters
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213) ================================================================================
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213) ================================================================================
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213) [MODEL SHARDING DIAGNOSTICS - Rank 0]
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Total parameters: 1,498,482,688
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   DTensor parameters: 1,498,482,688 (100.0%)
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Regular parameters: 0 (0.0%)
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Local storage (this worker): 0.37 GB
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Global storage (full model): 5.99 GB
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Shard ratio: 1/16.0 (this worker has 1/16 of model)
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)   Sample DTensor placements:
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)     model.embed_tokens.weight:
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Global shape: torch.Size([128256, 2048])
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Local shape: torch.Size([8016, 2048])
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Placements: (Replicate(), Shard(dim=0))
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Device mesh: DeviceMesh('cuda', [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]], mesh_dim_names=('dp_replicate', 'dp_shard_cp'))
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Device mesh shape: (1, 16)
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)     model.layers.0.self_attn.q_proj.weight:
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Global shape: torch.Size([2048, 2048])
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Local shape: torch.Size([128, 2048])
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Placements: (Replicate(), Shard(dim=0))
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Device mesh: DeviceMesh('cuda', [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]], mesh_dim_names=('dp_replicate', 'dp_shard_cp'))
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Device mesh shape: (1, 16)
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)     model.layers.0.self_attn.k_proj.weight:
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Global shape: torch.Size([512, 2048])
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Local shape: torch.Size([32, 2048])
2025-11-12 16:03:44
(DTensorPolicyWorkerV2 pid=1247213)       Placements: (Replicate(), Shard(dim=0))

Sharding on this branch:

2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604) ================================================================================
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604) [PARALLELISM CONFIG]
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   world_size = 16
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   tensor_parallel_size (TP) = 1
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   context_parallel_size (CP) = 1
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   data_parallel_size (DP/FSDP) = None
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   data_parallel_replicate_size = 2
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   sequence_parallel = False
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   FSDP shards model across None workers
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   Each worker has ~1/None of model parameters
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604) ================================================================================
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604) ================================================================================
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604) [MODEL SHARDING DIAGNOSTICS - Rank 0]
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   Total parameters: 1,498,482,688
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   DTensor parameters: 1,498,482,688 (100.0%)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   Regular parameters: 0 (0.0%)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   Local storage (this worker): 0.75 GB
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   Global storage (full model): 5.99 GB
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)   Shard ratio: 1/8.0 (this worker has 1/8 of model)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Global shape: torch.Size([128256, 2048])
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Local shape: torch.Size([16032, 2048])
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Placements: (Replicate(), Shard(dim=0))
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Device mesh: DeviceMesh('cuda', [[0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15]], mesh_dim_names=('dp_replicate', 'dp_shard_cp'))
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Device mesh shape: (2, 8)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)     model.layers.0.self_attn.q_proj.weight:
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Global shape: torch.Size([2048, 2048])
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Local shape: torch.Size([256, 2048])
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Placements: (Replicate(), Shard(dim=0))
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Device mesh: DeviceMesh('cuda', [[0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15]], mesh_dim_names=('dp_replicate', 'dp_shard_cp'))
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Device mesh shape: (2, 8)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)     model.layers.0.self_attn.k_proj.weight:
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Global shape: torch.Size([512, 2048])
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Local shape: torch.Size([64, 2048])
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Placements: (Replicate(), Shard(dim=0))
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Device mesh: DeviceMesh('cuda', [[0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15]], mesh_dim_names=('dp_replicate', 'dp_shard_cp'))
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604)       Device mesh shape: (2, 8)
2025-11-12 16:32:43
(DTensorPolicyWorkerV2 pid=1921604) ================================================================================

Summary by CodeRabbit

  • Refactor
    • Improved internal training infrastructure with updated distributed model parallelization setup.
    • Enhanced CPU offload support for distributed training scenarios.
    • Optimized attention mechanism selection during distributed training based on configuration parameters.

adil-a and others added 6 commits November 11, 2025 21:04
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: root <root@pool0-01523.cm.cluster>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Signed-off-by: root <root@pool0-01749.cm.cluster>
@adil-a adil-a requested a review from a team as a code owner November 12, 2025 05:24
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 128fb6f (PR #1509 from adil/fsdp-manager)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@adil-a adil-a added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 12, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

📝 Walkthrough

Walkthrough

Refactors FSDP2 initialization in the policy worker to use FSDP2Manager instead of manual device-mesh setup, adds cpu_offload handling, exposes manager-derived mesh attributes, implements dynamic attention implementation selection based on configuration, and updates gradient norm computation paths.

Changes

Cohort / File(s) Summary
FSDP2 Manager Integration & Initialization
nemo_rl/models/policy/dtensor_policy_worker_v2.py
Replaces manual FSDP2 device-mesh and parallelization with FSDP2Manager-based approach; exposes manager-derived mesh attributes (dp_mesh, dp_shard_cp_mesh, tp_mesh, cp_mesh) and size attributes; integrates cpu_offload handling with CUDA+CPU backend for init_process_group
Attention Implementation & Configuration
nemo_rl/models/policy/dtensor_policy_worker_v2.py
Adds dynamic attn_implementation selection logic to choose between flash_attention_2 and sdpa based on seq_packing and context_parallel_size; injects computed value into model_config
Training & Gradient Computation
nemo_rl/models/policy/dtensor_policy_worker_v2.py
Updates grad norm computation to reference dp_shard_cp_mesh instead of dp_cp_mesh; removes legacy OffloadPolicy import; restructures model wiring to use manager.parallelize

Sequence Diagram(s)

sequenceDiagram
    participant Worker as Policy Worker
    participant Manager as FSDP2Manager
    participant Model
    participant Training

    Worker->>Manager: Initialize with cpu_offload config
    activate Manager
    Manager->>Manager: Create device mesh<br/>(CUDA+CPU backend if offloading)
    Manager->>Manager: Configure offload policy
    Manager-->>Worker: Return mesh attributes<br/>(dp_mesh, tp_mesh, cp_mesh)
    deactivate Manager

    Worker->>Worker: Select attn_implementation<br/>(flash_attention_2 vs sdpa)<br/>based on seq_packing & context_parallel

    Worker->>Model: Inject attn_implementation<br/>into model_config

    Worker->>Manager: parallelize(model)
    activate Manager
    Manager->>Model: Apply FSDP2 sharding
    Manager-->>Worker: Model parallelized
    deactivate Manager

    Worker->>Training: Store mesh references<br/>(dp_shard_cp_mesh for grad norm)
    Training->>Training: Use dp_shard_cp_mesh<br/>for gradient computation
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

  • FSDP2Manager integration: Verify correct mesh creation, device placement, and offload policy configuration with cpu_offload handling
  • Attention implementation selection logic: Validate conditional logic for choosing between flash_attention_2 and sdpa; ensure proper config injection
  • Gradient norm computation updates: Confirm dp_shard_cp_mesh correctly replaces previous dp_cp_mesh references in training paths
  • Mesh attribute exposure and downstream references: Check all usages of newly exposed mesh attributes (dp_shard_cp_mesh, tp_mesh, etc.) are consistent and correctly applied

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR introduces major changes to DTensorPolicyV2 initialization including FSDP2Manager switch, gradient norm computation modifications, and attention implementation changes that directly affect training numerics, but no test results or convergence verification is documented. Add test results demonstrating convergence behavior is unchanged, gradient norms are correct across DP/CP configurations, performance metrics, and verification that the unresolved gradient norm issue is resolved.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: introducing Automodel initialization for DTensorPolicyV2 through FSDP2 manager integration.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch adil/fsdp-manager

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a40247 and 128fb6f.

📒 Files selected for processing (1)
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
🧠 Learnings (1)
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR

joyang-nv
joyang-nv previously approved these changes Nov 12, 2025
Copy link
Member

@joyang-nv joyang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring!

@terrykong
Copy link
Contributor

@adil-a do you mind adding the total step time before and after as well to the pr description?

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: d5cc915 (PR #1509 from adil/fsdp-manager)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: d5cc915 (PR #1509 from adil/fsdp-manager)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1509 from adil/fsdp-manager): https://github.com/NVIDIA-NeMo/Automodel/commits/8134b0c039802fb3f6161571400ef7085dd1e9cb/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 0577c10 (PR #1509 from adil/fsdp-manager)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0577c10 (PR #1509 from adil/fsdp-manager)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1509 from adil/fsdp-manager): https://github.com/NVIDIA-NeMo/Automodel/commits/8134b0c039802fb3f6161571400ef7085dd1e9cb/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@adil-a adil-a requested a review from a team as a code owner November 12, 2025 15:41
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 733529b (PR #1509 from adil/fsdp-manager)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@github-actions
Copy link

❌ Submodule Fast-Forward Check Failed

Check based on commit: 733529b (PR #1509 from adil/fsdp-manager)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/a2db048383cd54b3fafc928df4c30bf7bbf7c430/
CURRENT (PR #1509 from adil/fsdp-manager): https://github.com/NVIDIA-NeMo/Automodel/commits/8134b0c039802fb3f6161571400ef7085dd1e9cb/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 128fb6f (PR #1509 from adil/fsdp-manager)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@adil-a adil-a requested a review from a team as a code owner November 12, 2025 16:41
@github-actions
Copy link

⚠️ File Consistency Check

Check based on commit: 20d8bbc (PR #1509 from adil/fsdp-manager)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@adil-a adil-a added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants