fix: Fix policy worker placement when using unified placement group #1341

guyueh1 · 2025-10-11T00:11:25Z

What does this PR do ?

When using unified placement group, fix the worker placement so that every contiguous group of num_gpus_per_node workers are placed in the same node.

Relevant issues: ray-project/ray#51117
Referred to implementation in NovaSky-AI/SkyRL#338

Issues

closes #945
closes #895

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Deterministic worker placement across GPUs/nodes for more predictable runs.
- Consistent master address/port selection to reduce variability at startup.
- Enhanced single placement-group handling to bind workers to grouped bundles, improving resource affinity.
Bug Fixes
- Reduced flakiness during cluster initialization on multi-GPU/multi-node setups.
- More reliable worker-to-GPU mapping, minimizing unexpected placement shifts.
Performance
- Smoother, more consistent startup behavior under Ray-based execution.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

coderabbitai · 2025-10-11T00:16:06Z

📝 Walkthrough

Walkthrough

Adds GPU-aware bundle ordering to RayVirtualCluster via a new actor and internal sorting, updates master address/port selection to use the sorted bundles, and adjusts LM policy worker-group construction to map workers to bundle indices when a single placement group is used.

Changes

Cohort / File(s)	Summary of Changes
Distributed cluster: GPU-aware bundle ordering and PG introspection `nemo_rl/distributed/virtual_cluster.py`	- Added `GetGPUIDActor` (`@ray.remote(num_gpus=1)`) with `get_gpu_id` to retrieve a GPU id per bundle - Imported `placement_group_table` to inspect bundle-to-node mapping - Introduced `_get_sorted_bundle_indices` to deterministically order bundles by node and GPU id - Stored `self._sorted_bundle_indices` during placement-group init - `get_master_address_and_port` now prefers address/port from sorted bundle indices when available
LM policy: worker-group bundle assignment `nemo_rl/models/policy/lm_policy.py`	- Modified RayWorkerGroup construction: if a single placement group, derives `bundle_indices_list` by grouping every 8 bundles under one group index; otherwise retains previous `workers_per_node` behavior - No public API changes

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant RayVirtualCluster
  participant Ray as Ray API
  participant PG as PlacementGroups
  participant GPUActor as GetGPUIDActor

  User->>RayVirtualCluster: initialize()
  RayVirtualCluster->>Ray: create_placement_groups()
  Ray-->>RayVirtualCluster: handles to PGs
  RayVirtualCluster->>Ray: placement_group_table(PG)
  Ray-->>RayVirtualCluster: bundle->node mapping
  loop per bundle
    RayVirtualCluster->>GPUActor: create remote (num_gpus=1)
    RayVirtualCluster->>GPUActor: get_gpu_id()
    GPUActor-->>RayVirtualCluster: gpu_id
  end
  RayVirtualCluster->>RayVirtualCluster: _get_sorted_bundle_indices()
  RayVirtualCluster->>RayVirtualCluster: self._sorted_bundle_indices = result
  note over RayVirtualCluster: Later:
  User->>RayVirtualCluster: get_master_address_and_port()
  alt sorted indices available
    RayVirtualCluster->>PG: select first addr/port by sorted bundle order
  else fallback
    RayVirtualCluster->>PG: select first available addr/port (default)
  end
  RayVirtualCluster-->>User: (address, port)

sequenceDiagram
  autonumber
  participant Policy as LMPolicy
  participant RWG as RayWorkerGroup
  participant Cluster as RayVirtualCluster

  Policy->>Cluster: query placement groups
  alt Single placement group
    Policy->>Policy: build bundle_indices_list (tie every 8 bundles per group index)
    Policy->>RWG: construct(bundle_indices_list=..., sharding_annotations=...)
  else Multiple placement groups
    Policy->>RWG: construct(workers_per_node=..., sharding_annotations=...)
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (4 warnings)

Check name	Status	Explanation	Resolution
Linked Issues Check	⚠️ Warning	The pull request implements deterministic GPU bundle ordering and updates worker grouping logic but does not change the default placement_group_strategy to PACK as #895 requires nor address performance refit objectives from #945, so it does not meet the coding requirements of the linked issues.	Update the implementation to include the change of the default placement_group_strategy to PACK and address refit performance improvements for large Mixture of Experts models or adjust the linked issues to match the actual functionality delivered.
Out of Scope Changes Check	⚠️ Warning	All introduced changes focus on GPU bundle sorting and policy worker placement adjustments which are unrelated to changing the default placement_group_strategy or improving MoE model performance as specified by the linked issues, indicating they fall outside the stated objectives.	Limit the changes to those directly required by the linked issues or separate the GPU bundle ordering enhancements into a different pull request aligned with its own objectives.
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Test Results For Major Changes	⚠️ Warning	The PR introduces significant changes to worker placement logic and cluster bundle ordering, which directly affects runtime behavior and performance characteristics of training jobs, yet the PR description does not report any tests, convergence checks, or before/after performance numbers. Because major behavioral changes require corresponding testing evidence per the check instructions, the documentation is insufficient to verify the impact.	Please update the PR description with the relevant test outcomes, including any convergence or performance measurements that demonstrate the new placement logic does not regress existing workloads, and then rerun this check.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly describes the primary change—correcting policy worker placement under a unified placement group—and uses clear, specific phrasing relevant to the PR’s objective.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b19d669 and 2b1b893.

📒 Files selected for processing (2)

nemo_rl/distributed/virtual_cluster.py (4 hunks)
nemo_rl/models/policy/lm_policy.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/distributed/virtual_cluster.py
nemo_rl/models/policy/lm_policy.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/distributed/virtual_cluster.py
nemo_rl/models/policy/lm_policy.py

🧬 Code graph analysis (1)

nemo_rl/models/policy/lm_policy.py (4)

tests/unit/models/generation/test_vllm_generation.py (1)

cluster (221-232)

tests/unit/utils/test_native_checkpoint.py (1)

cluster (96-109)

nemo_rl/distributed/virtual_cluster.py (1)

get_placement_groups (354-362)

nemo_rl/distributed/worker_groups.py (1)

RayWorkerGroup (303-1004)

🪛 Ruff (0.13.3)

nemo_rl/distributed/virtual_cluster.py

419-419: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

nemo_rl/distributed/virtual_cluster.py

nemo_rl/models/policy/lm_policy.py

guyueh1 · 2025-10-11T00:16:48Z

Note that simply setting placement strategy to "PACK" does not apply, refer to this table showing node IP of workers in different setups:

	No UPG	UPG w/o fix	UPG strategy = PACK	UPG w/ fix
worker_id	node_ip	node_ip	node_ip	node_ip
0	28.67	15.1	14.17	17.223
1	28.67	18.1	11.143	17.223
2	28.67	21.159	11.143	17.223
3	28.67	11.133	11.143	17.223
4	28.67	14.209	11.143	17.223
5	28.67	11.139	11.143	17.223
6	28.67	20.81	11.143	17.223
7	28.67	12.9	11.143	17.223
8	20.195	12.87	21.79	18.145
9	20.195	28.67	21.79	18.145
10	20.195	31.17	21.79	18.145
11	20.195	4.85	21.79	18.145
12	20.195	11.201	21.79	18.145
13	20.195	13.151	21.79	18.145
14	20.195	21.79	21.79	18.145
15	20.195	18.145	21.79	18.145
16	21.159	20.195	11.143	11.139
17	21.159	11.143	4.137	11.139
18	21.159	4.137	4.137	11.139
19	21.159	30.195	4.137	11.139
20	21.159	26.205	4.137	11.139
21	21.159	15.26	4.137	11.139
22	21.159	16.155	4.137	11.139
23	21.159	12.29	4.137	11.139
24	18.1	14.17	18.145	31.17
25	18.1	26.65	18.145	31.17
26	18.1	14.69	18.145	31.17
27	18.1	31.27	18.145	31.17
28	18.1	20.141	18.145	31.17
29	18.1	6.91	18.145	31.17
30	18.1	17.223	18.145	31.17
31	18.1	11.195	18.145	31.17
32	11.133	15.1	26.205	26.65
33	11.133	18.1	16.155	26.65
34	11.133	21.159	16.155	26.65
35	11.133	11.133	16.155	26.65
36	11.133	14.209	16.155	26.65
37	11.133	11.139	16.155	26.65
38	11.133	20.81	16.155	26.65
39	11.133	12.9	16.155	26.65
40	11.195	12.87	15.26	14.17
41	11.195	28.67	15.26	14.17
42	11.195	31.17	15.26	14.17
43	11.195	4.85	15.26	14.17
44	11.195	11.201	15.26	14.17
45	11.195	13.151	15.26	14.17
46	11.195	21.79	15.26	14.17
47	11.195	18.145	15.26	14.17
48	14.209	20.195	16.155	20.81
49	14.209	11.143	26.205	20.81
50	14.209	4.137	26.205	20.81
51	14.209	30.195	26.205	20.81
52	14.209	26.205	26.205	20.81
53	14.209	15.26	26.205	20.81
54	14.209	16.155	26.205	20.81
55	14.209	12.29	26.205	20.81
56	11.139	14.17	15.1	16.155
57	11.139	26.65	15.1	16.155
58	11.139	14.69	15.1	16.155
59	11.139	31.27	15.1	16.155
60	11.139	20.141	15.1	16.155
61	11.139	6.91	15.1	16.155
62	11.139	17.223	15.1	16.155
63	11.139	11.195	15.1	16.155
64	12.87	15.1	4.137	4.137
65	12.87	18.1	20.141	4.137
66	12.87	21.159	20.141	4.137
67	12.87	11.133	20.141	4.137
68	12.87	14.209	20.141	4.137
69	12.87	11.139	20.141	4.137
70	12.87	20.81	20.141	4.137
71	12.87	12.9	20.141	4.137
72	12.9	12.87	6.91	31.27
73	12.9	28.67	6.91	31.27
74	12.9	31.17	6.91	31.27
75	12.9	4.85	6.91	31.27
76	12.9	11.201	6.91	31.27
77	12.9	13.151	6.91	31.27
78	12.9	21.79	6.91	31.27
79	12.9	18.145	6.91	31.27
80	20.81	20.195	20.141	30.195
81	20.81	11.143	14.17	30.195
82	20.81	4.137	14.17	30.195
83	20.81	30.195	14.17	30.195
84	20.81	26.205	14.17	30.195
85	20.81	15.26	14.17	30.195
86	20.81	16.155	14.17	30.195
87	20.81	12.29	14.17	30.195
88	6.91	14.17	26.65	12.9
89	6.91	26.65	26.65	12.9
90	6.91	14.69	26.65	12.9
91	6.91	31.27	26.65	12.9
92	6.91	20.141	26.65	12.9
93	6.91	6.91	26.65	12.9
94	6.91	17.223	26.65	12.9
95	6.91	11.195	26.65	12.9
96	31.17	15.1	31.27	14.209
97	31.17	18.1	11.195	14.209
98	31.17	21.159	11.195	14.209
99	31.17	11.133	11.195	14.209
100	31.17	14.209	11.195	14.209
101	31.17	11.139	11.195	14.209
102	31.17	20.81	11.195	14.209
103	31.17	12.9	11.195	14.209
104	17.223	12.87	17.223	12.87
105	17.223	28.67	17.223	12.87
106	17.223	31.17	17.223	12.87
107	17.223	4.85	17.223	12.87
108	17.223	11.201	17.223	12.87
109	17.223	13.151	17.223	12.87
110	17.223	21.79	17.223	12.87
111	17.223	18.145	17.223	12.87
112	4.85	20.195	11.195	20.195
113	4.85	11.143	31.27	20.195
114	4.85	4.137	31.27	20.195
115	4.85	30.195	31.27	20.195
116	4.85	26.205	31.27	20.195
117	4.85	15.26	31.27	20.195
118	4.85	16.155	31.27	20.195
119	4.85	12.29	31.27	20.195
120	11.201	14.17	14.69	21.79
121	11.201	26.65	14.69	21.79
122	11.201	14.69	14.69	21.79
123	11.201	31.27	14.69	21.79
124	11.201	20.141	14.69	21.79
125	11.201	6.91	14.69	21.79
126	11.201	17.223	14.69	21.79
127	11.201	11.195	14.69	21.79
128	13.151	15.1	12.29	15.26
129	13.151	18.1	14.209	15.26
130	13.151	21.159	14.209	15.26
131	13.151	11.133	14.209	15.26
132	13.151	14.209	14.209	15.26
133	13.151	11.139	14.209	15.26
134	13.151	20.81	14.209	15.26
135	13.151	12.9	14.209	15.26
136	21.79	12.87	11.139	15.1
137	21.79	28.67	11.139	15.1
138	21.79	31.17	11.139	15.1
139	21.79	4.85	11.139	15.1
140	21.79	11.201	11.139	15.1
141	21.79	13.151	11.139	15.1
142	21.79	21.79	11.139	15.1
143	21.79	18.145	11.139	15.1
144	18.145	20.195	14.209	13.151
145	18.145	11.143	20.195	13.151
146	18.145	4.137	20.195	13.151
147	18.145	30.195	20.195	13.151
148	18.145	26.205	20.195	13.151
149	18.145	15.26	20.195	13.151
150	18.145	16.155	20.195	13.151
151	18.145	12.29	20.195	13.151
152	15.1	14.17	21.159	20.141
153	15.1	26.65	21.159	20.141
154	15.1	14.69	21.159	20.141
155	15.1	31.27	21.159	20.141
156	15.1	20.141	21.159	20.141
157	15.1	6.91	21.159	20.141
158	15.1	17.223	21.159	20.141
159	15.1	11.195	21.159	20.141
160	14.17	26.205	12.9	6.91
161	14.17	12.87	28.67	6.91
162	14.17	12.9	28.67	6.91
163	14.17	20.81	28.67	6.91
164	14.17	14.209	28.67	6.91
165	14.17	11.139	28.67	6.91
166	14.17	20.195	28.67	6.91
167	14.17	21.159	28.67	6.91
168	26.65	11.133	12.87	14.69
169	26.65	18.1	12.87	14.69
170	26.65	11.201	12.87	14.69
171	26.65	4.85	12.87	14.69
172	26.65	13.151	12.87	14.69
173	26.65	31.17	12.87	14.69
174	26.65	12.29	12.87	14.69
175	26.65	30.195	12.87	14.69
176	31.27	28.67	28.67	11.201
177	31.27	15.26	12.9	11.201
178	31.27	16.155	12.9	11.201
179	31.27	15.1	12.9	11.201
180	31.27	11.143	12.9	11.201
181	31.27	4.137	12.9	11.201
182	31.27	21.79	12.9	11.201
183	31.27	18.145	12.9	11.201
184	14.69	11.195	20.81	11.133
185	14.69	17.223	20.81	11.133
186	14.69	14.69	20.81	11.133
187	14.69	6.91	20.81	11.133
188	14.69	20.141	20.81	11.133
189	14.69	31.27	20.81	11.133
190	14.69	14.17	20.81	11.133
191	14.69	26.65	20.81	11.133
192	11.143	15.1	20.195	11.195
193	11.143	18.1	13.151	11.195
194	11.143	21.159	13.151	11.195
195	11.143	11.133	13.151	11.195
196	11.143	14.209	13.151	11.195
197	11.143	11.139	13.151	11.195
198	11.143	20.81	13.151	11.195
199	11.143	12.9	13.151	11.195
200	20.141	12.87	31.17	26.205
201	20.141	28.67	31.17	26.205
202	20.141	31.17	31.17	26.205
203	20.141	4.85	31.17	26.205
204	20.141	11.201	31.17	26.205
205	20.141	13.151	31.17	26.205
206	20.141	21.79	31.17	26.205
207	20.141	18.145	31.17	26.205
208	15.26	20.195	13.151	4.85
209	15.26	11.143	12.29	4.85
210	15.26	4.137	12.29	4.85
211	15.26	30.195	12.29	4.85
212	15.26	26.205	12.29	4.85
213	15.26	15.26	12.29	4.85
214	15.26	16.155	12.29	4.85
215	15.26	12.29	12.29	4.85
216	4.137	14.17	30.195	21.159
217	4.137	26.65	30.195	21.159
218	4.137	14.69	30.195	21.159
219	4.137	31.27	30.195	21.159
220	4.137	20.141	30.195	21.159
221	4.137	6.91	30.195	21.159
222	4.137	17.223	30.195	21.159
223	4.137	11.195	30.195	21.159
224	12.29	15.1	11.201	12.29
225	12.29	18.1	11.133	12.29
226	12.29	21.159	11.133	12.29
227	12.29	11.133	11.133	12.29
228	12.29	14.209	11.133	12.29
229	12.29	11.139	11.133	12.29
230	12.29	20.81	11.133	12.29
231	12.29	12.9	11.133	12.29
232	30.195	12.87	18.1	28.67
233	30.195	28.67	18.1	28.67
234	30.195	31.17	18.1	28.67
235	30.195	4.85	18.1	28.67
236	30.195	11.201	18.1	28.67
237	30.195	13.151	18.1	28.67
238	30.195	21.79	18.1	28.67
239	30.195	18.145	18.1	28.67
240	16.155	20.195	11.133	18.1
241	16.155	11.143	11.201	18.1
242	16.155	4.137	11.201	18.1
243	16.155	30.195	11.201	18.1
244	16.155	26.205	11.201	18.1
245	16.155	15.26	11.201	18.1
246	16.155	16.155	11.201	18.1
247	16.155	12.29	11.201	18.1
248	26.205	14.17	4.85	11.143
249	26.205	26.65	4.85	11.143
250	26.205	14.69	4.85	11.143
251	26.205	31.27	4.85	11.143
252	26.205	20.141	4.85	11.143
253	26.205	6.91	4.85	11.143
254	26.205	17.223	4.85	11.143
255	26.205	11.195	4.85	11.143

guyueh1 · 2025-10-11T00:35:56Z

Before optimization: refit time 42~45s
After optimization: refit time 36~37s
A comparison of refit time CUDA activities (from trainer side), up: with optimization; bottom: no optimization. We can observe that the NCCL time at trainer side is much shorter. Now the bottleneck becomes the IPC time and vllm update time

ZhiyuLi-Nvidia · 2025-10-11T04:15:28Z

Thank you @guyueh1! Might need some help to under the PR.
Is the performance gain mainly from fixing the device mesh assignment to replace slow inter-node communication with fast intra-node communication which wasn't yet covered by strategy="PACK" strategy?
And if we follow the new device mesh ordering with your sorting implementation, then the device mesh should be optimal, for example it will correctly prioritize mapping the Tensor Parallel (TP) dimension to devices within the same node.

guyueh1 · 2025-10-11T22:34:03Z

Is the performance gain mainly from fixing the device mesh assignment to replace slow inter-node communication with fast intra-node communication which wasn't yet covered by strategy="PACK" strategy?

yes, this PR will make sure that every contiguous 8 megatron workers (0-7, 8-15, ...) will be in the same node, so in EP/PP mapping, this will make sure EP group contains NVLink and maximize the EP all-gather bandwidth. Previously we were using "SPREAD" strategy for unified placement-group code path (link), according to Ray issue ray-project/ray#51117, there is no way (even if we use strategy="PACK") to guarantee that bundle indices 0~~7, 8~~15, ... will be in the same node. Thus, my way is to first use any strategy to allocate resource bundles, then sort the bundle indices based on the retrieved node-IP and gpu ID, so we get a new list of bundle indices that is not in order, but their underlying resources are packed; then we allocate megatron workers based on the new bundle indices list.

guyueh1 · 2025-10-11T22:38:06Z

And if we follow the new device mesh ordering with your sorting implementation, then the device mesh should be optimal, for example it will correctly prioritize mapping the Tensor Parallel (TP) dimension to devices within the same node.

Correct; this PR just ensures that megatron worker IDs is node-wise packed, i.e. worker ID n is placed on node_rank=n//gpus_per_node, gpu_id=n%gpus_per_node. That's the same assumption when we use megatron outside Ray, for example with torchrun. So if we make sure this assumption is correct, then megatron can perform optimal mapping of all parallelism, usually in order of TP-CP-EP-DP-PP.

ZhiyuLi-Nvidia

Awesome change with great improvement
Thank you @guyueh1 for the change and explanation. LGTM!

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

terrykong · 2025-10-12T05:24:18Z

Hi @richardliaw. Would love to hear your opinion here. What do you think about this approach from the lens of ray? Do you think PACK should do this?

github-actions · 2025-10-15T16:23:56Z

⚠️ File Consistency Check

Check based on commit: 7fbfb2a (PR #1341 from fix_policy_worker_placement)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/dtensor_policy_worker.py
Update nemo_rl/models/policy/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

github-actions · 2025-10-15T21:59:48Z

ℹ️ File Consistency Check

Check based on commit: 79a4090 (PR #1341 from fix_policy_worker_placement)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

…mock cluster test Signed-off-by: Guyue Huang <guyueh@nvidia.com>

github-actions · 2025-10-16T18:41:23Z

ℹ️ File Consistency Check

Check based on commit: 7ab2263 (PR #1341 from fix_policy_worker_placement)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2025-10-16T21:47:45Z

ℹ️ File Consistency Check

Check based on commit: bf8712d (PR #1341 from fix_policy_worker_placement)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

nemo_rl/models/policy/dtensor_policy_worker.py
nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

guyueh1 added 3 commits October 9, 2025 11:02

save

6be8f59

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

attempt to fix

11628f6

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

fix

2b1b893

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested review from a team as code owners October 11, 2025 00:11

guyueh1 changed the title ~~Fix policy worker placement~~ feat: Fix policy worker placement when using unified placement group Oct 11, 2025

coderabbitai bot reviewed Oct 11, 2025

View reviewed changes

nemo_rl/distributed/virtual_cluster.py Show resolved Hide resolved

nemo_rl/distributed/virtual_cluster.py Show resolved Hide resolved

nemo_rl/distributed/virtual_cluster.py Show resolved Hide resolved

nemo_rl/models/policy/lm_policy.py Outdated Show resolved Hide resolved

guyueh1 requested review from ZhiyuLi-Nvidia and yuki-97 October 11, 2025 00:36

guyueh1 mentioned this pull request Oct 11, 2025

Refit Performance for Large MoE Models #945

Open

Merge branch 'main' into fix_policy_worker_placement

6a0b9c7

guyueh1 added Performance Related to improving performance and removed Performance Related to improving performance labels Oct 11, 2025

ZhiyuLi-Nvidia previously approved these changes Oct 11, 2025

View reviewed changes

guyueh1 dismissed ZhiyuLi-Nvidia’s stale review via f6c4046 October 11, 2025 22:46

guyueh1 and others added 2 commits October 11, 2025 15:46

Apply suggestions from code review

f6c4046

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>

Address comments

e036943

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 changed the title ~~feat: Fix policy worker placement when using unified placement group~~ fix: Fix policy worker placement when using unified placement group Oct 11, 2025

guyueh1 added the CI:L0 Run doctests and unit tests label Oct 12, 2025

guyueh1 temporarily deployed to nemo-ci October 12, 2025 02:23 — with GitHub Actions Inactive

euronymous-aithal added the r0.4.0 label Oct 12, 2025

guyueh1 self-assigned this Oct 15, 2025

guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 15, 2025

guyueh1 temporarily deployed to nemo-ci October 15, 2025 16:24 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci October 15, 2025 16:39 — with GitHub Actions Inactive

Fix UT

79a4090

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 added CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 15, 2025

guyueh1 temporarily deployed to nemo-ci October 15, 2025 22:00 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci October 15, 2025 23:15 — with GitHub Actions Inactive

yuki-97 previously approved these changes Oct 16, 2025

View reviewed changes

Move print_node_ip_and_gpu_id outside init function of policy to fix …

7ab2263

…mock cluster test Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 dismissed yuki-97’s stale review via 7ab2263 October 16, 2025 18:40

guyueh1 requested review from a team as code owners October 16, 2025 18:40

guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 16, 2025

guyueh1 temporarily deployed to nemo-ci October 16, 2025 19:54 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci October 16, 2025 19:56 — with GitHub Actions Inactive

Merge branch 'main' into fix_policy_worker_placement

bf8712d

guyueh1 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 16, 2025

guyueh1 temporarily deployed to nemo-ci October 16, 2025 21:48 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci October 16, 2025 21:58 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci October 17, 2025 01:41 — with GitHub Actions Inactive

guyueh1 mentioned this pull request Oct 20, 2025

feat: Printing worker id to node.bundle_idx mapping #1366

Draft

4 tasks

fix: Fix policy worker placement when using unified placement group #1341

Are you sure you want to change the base?

fix: Fix policy worker placement when using unified placement group #1341

Uh oh!

Conversation

guyueh1 commented Oct 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guyueh1 commented Oct 11, 2025

Uh oh!

guyueh1 commented Oct 11, 2025

Uh oh!

ZhiyuLi-Nvidia commented Oct 11, 2025

Uh oh!

guyueh1 commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guyueh1 commented Oct 11, 2025

Uh oh!

ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Oct 12, 2025

Uh oh!

github-actions bot commented Oct 15, 2025

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions bot commented Oct 15, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 16, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

github-actions bot commented Oct 16, 2025

ℹ️ File Consistency Check

✅ DTensor Policy Worker Synchronization Check

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

guyueh1 commented Oct 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 11, 2025 •

edited

Loading

guyueh1 commented Oct 11, 2025 •

edited

Loading