Skip to content

Conversation

guyueh1
Copy link
Contributor

@guyueh1 guyueh1 commented Oct 11, 2025

What does this PR do ?

When using unified placement group, fix the worker placement so that every contiguous group of num_gpus_per_node workers are placed in the same node.

Relevant issues: ray-project/ray#51117
Referred to implementation in NovaSky-AI/SkyRL#338

Issues

closes #945
closes #895

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Deterministic worker placement across GPUs/nodes for more predictable runs.
    • Consistent master address/port selection to reduce variability at startup.
    • Enhanced single placement-group handling to bind workers to grouped bundles, improving resource affinity.
  • Bug Fixes

    • Reduced flakiness during cluster initialization on multi-GPU/multi-node setups.
    • More reliable worker-to-GPU mapping, minimizing unexpected placement shifts.
  • Performance

    • Smoother, more consistent startup behavior under Ray-based execution.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners October 11, 2025 00:11
@guyueh1 guyueh1 changed the title Fix policy worker placement feat: Fix policy worker placement when using unified placement group Oct 11, 2025
Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

📝 Walkthrough

Walkthrough

Adds GPU-aware bundle ordering to RayVirtualCluster via a new actor and internal sorting, updates master address/port selection to use the sorted bundles, and adjusts LM policy worker-group construction to map workers to bundle indices when a single placement group is used.

Changes

Cohort / File(s) Summary of Changes
Distributed cluster: GPU-aware bundle ordering and PG introspection
nemo_rl/distributed/virtual_cluster.py
- Added GetGPUIDActor (@ray.remote(num_gpus=1)) with get_gpu_id to retrieve a GPU id per bundle
- Imported placement_group_table to inspect bundle-to-node mapping
- Introduced _get_sorted_bundle_indices to deterministically order bundles by node and GPU id
- Stored self._sorted_bundle_indices during placement-group init
- get_master_address_and_port now prefers address/port from sorted bundle indices when available
LM policy: worker-group bundle assignment
nemo_rl/models/policy/lm_policy.py
- Modified RayWorkerGroup construction: if a single placement group, derives bundle_indices_list by grouping every 8 bundles under one group index; otherwise retains previous workers_per_node behavior
- No public API changes

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant RayVirtualCluster
  participant Ray as Ray API
  participant PG as PlacementGroups
  participant GPUActor as GetGPUIDActor

  User->>RayVirtualCluster: initialize()
  RayVirtualCluster->>Ray: create_placement_groups()
  Ray-->>RayVirtualCluster: handles to PGs
  RayVirtualCluster->>Ray: placement_group_table(PG)
  Ray-->>RayVirtualCluster: bundle->node mapping
  loop per bundle
    RayVirtualCluster->>GPUActor: create remote (num_gpus=1)
    RayVirtualCluster->>GPUActor: get_gpu_id()
    GPUActor-->>RayVirtualCluster: gpu_id
  end
  RayVirtualCluster->>RayVirtualCluster: _get_sorted_bundle_indices()
  RayVirtualCluster->>RayVirtualCluster: self._sorted_bundle_indices = result
  note over RayVirtualCluster: Later:
  User->>RayVirtualCluster: get_master_address_and_port()
  alt sorted indices available
    RayVirtualCluster->>PG: select first addr/port by sorted bundle order
  else fallback
    RayVirtualCluster->>PG: select first available addr/port (default)
  end
  RayVirtualCluster-->>User: (address, port)
Loading
sequenceDiagram
  autonumber
  participant Policy as LMPolicy
  participant RWG as RayWorkerGroup
  participant Cluster as RayVirtualCluster

  Policy->>Cluster: query placement groups
  alt Single placement group
    Policy->>Policy: build bundle_indices_list (tie every 8 bundles per group index)
    Policy->>RWG: construct(bundle_indices_list=..., sharding_annotations=...)
  else Multiple placement groups
    Policy->>RWG: construct(workers_per_node=..., sharding_annotations=...)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (4 warnings)
Check name Status Explanation Resolution
Linked Issues Check ⚠️ Warning The pull request implements deterministic GPU bundle ordering and updates worker grouping logic but does not change the default placement_group_strategy to PACK as #895 requires nor address performance refit objectives from #945, so it does not meet the coding requirements of the linked issues. Update the implementation to include the change of the default placement_group_strategy to PACK and address refit performance improvements for large Mixture of Experts models or adjust the linked issues to match the actual functionality delivered.
Out of Scope Changes Check ⚠️ Warning All introduced changes focus on GPU bundle sorting and policy worker placement adjustments which are unrelated to changing the default placement_group_strategy or improving MoE model performance as specified by the linked issues, indicating they fall outside the stated objectives. Limit the changes to those directly required by the linked issues or separate the GPU bundle ordering enhancements into a different pull request aligned with its own objectives.
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning The PR introduces significant changes to worker placement logic and cluster bundle ordering, which directly affects runtime behavior and performance characteristics of training jobs, yet the PR description does not report any tests, convergence checks, or before/after performance numbers. Because major behavioral changes require corresponding testing evidence per the check instructions, the documentation is insufficient to verify the impact. Please update the PR description with the relevant test outcomes, including any convergence or performance measurements that demonstrate the new placement logic does not regress existing workloads, and then rerun this check.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly describes the primary change—correcting policy worker placement under a unified placement group—and uses clear, specific phrasing relevant to the PR’s objective.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b19d669 and 2b1b893.

📒 Files selected for processing (2)
  • nemo_rl/distributed/virtual_cluster.py (4 hunks)
  • nemo_rl/models/policy/lm_policy.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/distributed/virtual_cluster.py
  • nemo_rl/models/policy/lm_policy.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/distributed/virtual_cluster.py
  • nemo_rl/models/policy/lm_policy.py
🧬 Code graph analysis (1)
nemo_rl/models/policy/lm_policy.py (4)
tests/unit/models/generation/test_vllm_generation.py (1)
  • cluster (221-232)
tests/unit/utils/test_native_checkpoint.py (1)
  • cluster (96-109)
nemo_rl/distributed/virtual_cluster.py (1)
  • get_placement_groups (354-362)
nemo_rl/distributed/worker_groups.py (1)
  • RayWorkerGroup (303-1004)
🪛 Ruff (0.13.3)
nemo_rl/distributed/virtual_cluster.py

419-419: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR

@guyueh1
Copy link
Contributor Author

guyueh1 commented Oct 11, 2025

Note that simply setting placement strategy to "PACK" does not apply, refer to this table showing node IP of workers in different setups:

No UPG UPG w/o fix UPG strategy = PACK UPG w/ fix
worker_id node_ip node_ip node_ip node_ip
0 28.67 15.1 14.17 17.223
1 28.67 18.1 11.143 17.223
2 28.67 21.159 11.143 17.223
3 28.67 11.133 11.143 17.223
4 28.67 14.209 11.143 17.223
5 28.67 11.139 11.143 17.223
6 28.67 20.81 11.143 17.223
7 28.67 12.9 11.143 17.223
8 20.195 12.87 21.79 18.145
9 20.195 28.67 21.79 18.145
10 20.195 31.17 21.79 18.145
11 20.195 4.85 21.79 18.145
12 20.195 11.201 21.79 18.145
13 20.195 13.151 21.79 18.145
14 20.195 21.79 21.79 18.145
15 20.195 18.145 21.79 18.145
16 21.159 20.195 11.143 11.139
17 21.159 11.143 4.137 11.139
18 21.159 4.137 4.137 11.139
19 21.159 30.195 4.137 11.139
20 21.159 26.205 4.137 11.139
21 21.159 15.26 4.137 11.139
22 21.159 16.155 4.137 11.139
23 21.159 12.29 4.137 11.139
24 18.1 14.17 18.145 31.17
25 18.1 26.65 18.145 31.17
26 18.1 14.69 18.145 31.17
27 18.1 31.27 18.145 31.17
28 18.1 20.141 18.145 31.17
29 18.1 6.91 18.145 31.17
30 18.1 17.223 18.145 31.17
31 18.1 11.195 18.145 31.17
32 11.133 15.1 26.205 26.65
33 11.133 18.1 16.155 26.65
34 11.133 21.159 16.155 26.65
35 11.133 11.133 16.155 26.65
36 11.133 14.209 16.155 26.65
37 11.133 11.139 16.155 26.65
38 11.133 20.81 16.155 26.65
39 11.133 12.9 16.155 26.65
40 11.195 12.87 15.26 14.17
41 11.195 28.67 15.26 14.17
42 11.195 31.17 15.26 14.17
43 11.195 4.85 15.26 14.17
44 11.195 11.201 15.26 14.17
45 11.195 13.151 15.26 14.17
46 11.195 21.79 15.26 14.17
47 11.195 18.145 15.26 14.17
48 14.209 20.195 16.155 20.81
49 14.209 11.143 26.205 20.81
50 14.209 4.137 26.205 20.81
51 14.209 30.195 26.205 20.81
52 14.209 26.205 26.205 20.81
53 14.209 15.26 26.205 20.81
54 14.209 16.155 26.205 20.81
55 14.209 12.29 26.205 20.81
56 11.139 14.17 15.1 16.155
57 11.139 26.65 15.1 16.155
58 11.139 14.69 15.1 16.155
59 11.139 31.27 15.1 16.155
60 11.139 20.141 15.1 16.155
61 11.139 6.91 15.1 16.155
62 11.139 17.223 15.1 16.155
63 11.139 11.195 15.1 16.155
64 12.87 15.1 4.137 4.137
65 12.87 18.1 20.141 4.137
66 12.87 21.159 20.141 4.137
67 12.87 11.133 20.141 4.137
68 12.87 14.209 20.141 4.137
69 12.87 11.139 20.141 4.137
70 12.87 20.81 20.141 4.137
71 12.87 12.9 20.141 4.137
72 12.9 12.87 6.91 31.27
73 12.9 28.67 6.91 31.27
74 12.9 31.17 6.91 31.27
75 12.9 4.85 6.91 31.27
76 12.9 11.201 6.91 31.27
77 12.9 13.151 6.91 31.27
78 12.9 21.79 6.91 31.27
79 12.9 18.145 6.91 31.27
80 20.81 20.195 20.141 30.195
81 20.81 11.143 14.17 30.195
82 20.81 4.137 14.17 30.195
83 20.81 30.195 14.17 30.195
84 20.81 26.205 14.17 30.195
85 20.81 15.26 14.17 30.195
86 20.81 16.155 14.17 30.195
87 20.81 12.29 14.17 30.195
88 6.91 14.17 26.65 12.9
89 6.91 26.65 26.65 12.9
90 6.91 14.69 26.65 12.9
91 6.91 31.27 26.65 12.9
92 6.91 20.141 26.65 12.9
93 6.91 6.91 26.65 12.9
94 6.91 17.223 26.65 12.9
95 6.91 11.195 26.65 12.9
96 31.17 15.1 31.27 14.209
97 31.17 18.1 11.195 14.209
98 31.17 21.159 11.195 14.209
99 31.17 11.133 11.195 14.209
100 31.17 14.209 11.195 14.209
101 31.17 11.139 11.195 14.209
102 31.17 20.81 11.195 14.209
103 31.17 12.9 11.195 14.209
104 17.223 12.87 17.223 12.87
105 17.223 28.67 17.223 12.87
106 17.223 31.17 17.223 12.87
107 17.223 4.85 17.223 12.87
108 17.223 11.201 17.223 12.87
109 17.223 13.151 17.223 12.87
110 17.223 21.79 17.223 12.87
111 17.223 18.145 17.223 12.87
112 4.85 20.195 11.195 20.195
113 4.85 11.143 31.27 20.195
114 4.85 4.137 31.27 20.195
115 4.85 30.195 31.27 20.195
116 4.85 26.205 31.27 20.195
117 4.85 15.26 31.27 20.195
118 4.85 16.155 31.27 20.195
119 4.85 12.29 31.27 20.195
120 11.201 14.17 14.69 21.79
121 11.201 26.65 14.69 21.79
122 11.201 14.69 14.69 21.79
123 11.201 31.27 14.69 21.79
124 11.201 20.141 14.69 21.79
125 11.201 6.91 14.69 21.79
126 11.201 17.223 14.69 21.79
127 11.201 11.195 14.69 21.79
128 13.151 15.1 12.29 15.26
129 13.151 18.1 14.209 15.26
130 13.151 21.159 14.209 15.26
131 13.151 11.133 14.209 15.26
132 13.151 14.209 14.209 15.26
133 13.151 11.139 14.209 15.26
134 13.151 20.81 14.209 15.26
135 13.151 12.9 14.209 15.26
136 21.79 12.87 11.139 15.1
137 21.79 28.67 11.139 15.1
138 21.79 31.17 11.139 15.1
139 21.79 4.85 11.139 15.1
140 21.79 11.201 11.139 15.1
141 21.79 13.151 11.139 15.1
142 21.79 21.79 11.139 15.1
143 21.79 18.145 11.139 15.1
144 18.145 20.195 14.209 13.151
145 18.145 11.143 20.195 13.151
146 18.145 4.137 20.195 13.151
147 18.145 30.195 20.195 13.151
148 18.145 26.205 20.195 13.151
149 18.145 15.26 20.195 13.151
150 18.145 16.155 20.195 13.151
151 18.145 12.29 20.195 13.151
152 15.1 14.17 21.159 20.141
153 15.1 26.65 21.159 20.141
154 15.1 14.69 21.159 20.141
155 15.1 31.27 21.159 20.141
156 15.1 20.141 21.159 20.141
157 15.1 6.91 21.159 20.141
158 15.1 17.223 21.159 20.141
159 15.1 11.195 21.159 20.141
160 14.17 26.205 12.9 6.91
161 14.17 12.87 28.67 6.91
162 14.17 12.9 28.67 6.91
163 14.17 20.81 28.67 6.91
164 14.17 14.209 28.67 6.91
165 14.17 11.139 28.67 6.91
166 14.17 20.195 28.67 6.91
167 14.17 21.159 28.67 6.91
168 26.65 11.133 12.87 14.69
169 26.65 18.1 12.87 14.69
170 26.65 11.201 12.87 14.69
171 26.65 4.85 12.87 14.69
172 26.65 13.151 12.87 14.69
173 26.65 31.17 12.87 14.69
174 26.65 12.29 12.87 14.69
175 26.65 30.195 12.87 14.69
176 31.27 28.67 28.67 11.201
177 31.27 15.26 12.9 11.201
178 31.27 16.155 12.9 11.201
179 31.27 15.1 12.9 11.201
180 31.27 11.143 12.9 11.201
181 31.27 4.137 12.9 11.201
182 31.27 21.79 12.9 11.201
183 31.27 18.145 12.9 11.201
184 14.69 11.195 20.81 11.133
185 14.69 17.223 20.81 11.133
186 14.69 14.69 20.81 11.133
187 14.69 6.91 20.81 11.133
188 14.69 20.141 20.81 11.133
189 14.69 31.27 20.81 11.133
190 14.69 14.17 20.81 11.133
191 14.69 26.65 20.81 11.133
192 11.143 15.1 20.195 11.195
193 11.143 18.1 13.151 11.195
194 11.143 21.159 13.151 11.195
195 11.143 11.133 13.151 11.195
196 11.143 14.209 13.151 11.195
197 11.143 11.139 13.151 11.195
198 11.143 20.81 13.151 11.195
199 11.143 12.9 13.151 11.195
200 20.141 12.87 31.17 26.205
201 20.141 28.67 31.17 26.205
202 20.141 31.17 31.17 26.205
203 20.141 4.85 31.17 26.205
204 20.141 11.201 31.17 26.205
205 20.141 13.151 31.17 26.205
206 20.141 21.79 31.17 26.205
207 20.141 18.145 31.17 26.205
208 15.26 20.195 13.151 4.85
209 15.26 11.143 12.29 4.85
210 15.26 4.137 12.29 4.85
211 15.26 30.195 12.29 4.85
212 15.26 26.205 12.29 4.85
213 15.26 15.26 12.29 4.85
214 15.26 16.155 12.29 4.85
215 15.26 12.29 12.29 4.85
216 4.137 14.17 30.195 21.159
217 4.137 26.65 30.195 21.159
218 4.137 14.69 30.195 21.159
219 4.137 31.27 30.195 21.159
220 4.137 20.141 30.195 21.159
221 4.137 6.91 30.195 21.159
222 4.137 17.223 30.195 21.159
223 4.137 11.195 30.195 21.159
224 12.29 15.1 11.201 12.29
225 12.29 18.1 11.133 12.29
226 12.29 21.159 11.133 12.29
227 12.29 11.133 11.133 12.29
228 12.29 14.209 11.133 12.29
229 12.29 11.139 11.133 12.29
230 12.29 20.81 11.133 12.29
231 12.29 12.9 11.133 12.29
232 30.195 12.87 18.1 28.67
233 30.195 28.67 18.1 28.67
234 30.195 31.17 18.1 28.67
235 30.195 4.85 18.1 28.67
236 30.195 11.201 18.1 28.67
237 30.195 13.151 18.1 28.67
238 30.195 21.79 18.1 28.67
239 30.195 18.145 18.1 28.67
240 16.155 20.195 11.133 18.1
241 16.155 11.143 11.201 18.1
242 16.155 4.137 11.201 18.1
243 16.155 30.195 11.201 18.1
244 16.155 26.205 11.201 18.1
245 16.155 15.26 11.201 18.1
246 16.155 16.155 11.201 18.1
247 16.155 12.29 11.201 18.1
248 26.205 14.17 4.85 11.143
249 26.205 26.65 4.85 11.143
250 26.205 14.69 4.85 11.143
251 26.205 31.27 4.85 11.143
252 26.205 20.141 4.85 11.143
253 26.205 6.91 4.85 11.143
254 26.205 17.223 4.85 11.143
255 26.205 11.195 4.85 11.143

@guyueh1
Copy link
Contributor Author

guyueh1 commented Oct 11, 2025

  • Before optimization: refit time 42~45s
  • After optimization: refit time 36~37s
    A comparison of refit time CUDA activities (from trainer side), up: with optimization; bottom: no optimization. We can observe that the NCCL time at trainer side is much shorter. Now the bottleneck becomes the IPC time and vllm update time
image

@ZhiyuLi-Nvidia
Copy link
Contributor

Thank you @guyueh1! Might need some help to under the PR.
Is the performance gain mainly from fixing the device mesh assignment to replace slow inter-node communication with fast intra-node communication which wasn't yet covered by strategy="PACK" strategy?
And if we follow the new device mesh ordering with your sorting implementation, then the device mesh should be optimal, for example it will correctly prioritize mapping the Tensor Parallel (TP) dimension to devices within the same node.

@guyueh1
Copy link
Contributor Author

guyueh1 commented Oct 11, 2025

Is the performance gain mainly from fixing the device mesh assignment to replace slow inter-node communication with fast intra-node communication which wasn't yet covered by strategy="PACK" strategy?

yes, this PR will make sure that every contiguous 8 megatron workers (0-7, 8-15, ...) will be in the same node, so in EP/PP mapping, this will make sure EP group contains NVLink and maximize the EP all-gather bandwidth. Previously we were using "SPREAD" strategy for unified placement-group code path (link), according to Ray issue ray-project/ray#51117, there is no way (even if we use strategy="PACK") to guarantee that bundle indices 07, 815, ... will be in the same node. Thus, my way is to first use any strategy to allocate resource bundles, then sort the bundle indices based on the retrieved node-IP and gpu ID, so we get a new list of bundle indices that is not in order, but their underlying resources are packed; then we allocate megatron workers based on the new bundle indices list.

@guyueh1
Copy link
Contributor Author

guyueh1 commented Oct 11, 2025

And if we follow the new device mesh ordering with your sorting implementation, then the device mesh should be optimal, for example it will correctly prioritize mapping the Tensor Parallel (TP) dimension to devices within the same node.

Correct; this PR just ensures that megatron worker IDs is node-wise packed, i.e. worker ID n is placed on node_rank=n//gpus_per_node, gpu_id=n%gpus_per_node. That's the same assumption when we use megatron outside Ray, for example with torchrun. So if we make sure this assumption is correct, then megatron can perform optimal mapping of all parallelism, usually in order of TP-CP-EP-DP-PP.

@guyueh1 guyueh1 added Performance Related to improving performance and removed Performance Related to improving performance labels Oct 11, 2025
ZhiyuLi-Nvidia
ZhiyuLi-Nvidia previously approved these changes Oct 11, 2025
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome change with great improvement
Thank you @guyueh1 for the change and explanation. LGTM!

guyueh1 and others added 2 commits October 11, 2025 15:46
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 changed the title feat: Fix policy worker placement when using unified placement group fix: Fix policy worker placement when using unified placement group Oct 11, 2025
@guyueh1 guyueh1 added the CI:L0 Run doctests and unit tests label Oct 12, 2025
@terrykong
Copy link
Contributor

Hi @richardliaw. Would love to hear your opinion here. What do you think about this approach from the lens of ray? Do you think PACK should do this?

Copy link

⚠️ File Consistency Check

Check based on commit: 7fbfb2a (PR #1341 from fix_policy_worker_placement)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

  • Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
  • Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
  • If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

  • Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@guyueh1 guyueh1 self-assigned this Oct 15, 2025
@guyueh1 guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 15, 2025
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Copy link

ℹ️ File Consistency Check

Check based on commit: 79a4090 (PR #1341 from fix_policy_worker_placement)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@guyueh1 guyueh1 added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 15, 2025
yuki-97
yuki-97 previously approved these changes Oct 16, 2025
…mock cluster test

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Copy link

ℹ️ File Consistency Check

Check based on commit: 7ab2263 (PR #1341 from fix_policy_worker_placement)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@guyueh1 guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 16, 2025
@guyueh1 guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 16, 2025
Copy link

ℹ️ File Consistency Check

Check based on commit: bf8712d (PR #1341 from fix_policy_worker_placement)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refit Performance for Large MoE Models Change default placement_group_strategy to PACK

7 participants