feat: Printing worker id to node.bundle_idx mapping #1366

youngeunkwon0405 · 2025-10-15T18:27:33Z

What does this PR do ?

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added detailed logging that maps workers to their physical nodes and GPUs, providing an organized, tree-like view per node.
- Enhanced worker identification in logs with bundle indices and global ranks for easier troubleshooting and monitoring.
- Automatically infers and displays worker roles for clearer cluster overviews.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

coderabbitai · 2025-10-15T18:27:43Z

📝 Walkthrough

Walkthrough

Adds bundle_idx to worker metadata during creation and introduces a private method to print a worker-to-node/GPU mapping. The mapping is invoked right after workers are created, deriving node and IP details from Ray placement group tables and node info, and formatting workers by node with global_rank and bundle_idx.

Changes

Cohort / File(s)	Summary of Changes
Worker metadata + diagnostics `nemo_rl/distributed/worker_groups.py`	- Include bundle_idx in worker creation metadata and per-worker info - Invoke a new private method after worker creation to print worker→node/GPU mapping - Add `_print_worker_to_node_gpu_mapping` that resolves placement-group→node/IP, infers role, groups workers by node, and prints ordered tree; exceptions are safely handled

Sequence Diagram(s)

sequenceDiagram
  actor Dev as Caller
  participant WG as RayWorkerGroup
  participant Ray as Ray / Cluster
  participant PG as Placement Group Table
  participant Nodes as Ray Node Info
  participant Out as Console

  Dev->>WG: _create_workers_from_bundle_indices(bundle_indices)
  WG->>Ray: Create workers (actors) with metadata (global_rank, bundle_idx, ...)
  Ray-->>WG: Worker handles + placement group refs
  WG->>WG: _print_worker_to_node_gpu_mapping()
  WG->>PG: Query placement_group_table()
  PG-->>WG: PG index → node IDs
  WG->>Nodes: Fetch node list (IDs → IPs)
  Nodes-->>WG: Node metadata (IPs, resources)
  WG->>Out: Print workers grouped by node with global_rank and bundle_idx

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly conveys the primary addition of printing a mapping between worker IDs and their bundle indices on each node. It aligns with the core change of introducing the new printing method without extraneous detail or noise.
Test Results For Major Changes	✅ Passed	The PR introduces a logging feature for displaying worker‐to‐node mappings without altering numeric algorithms or performance‐critical code, and it does not include any regressions in convergence or performance, so no test results are strictly required for this minor change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

youngeunkwon0405 · 2025-10-15T18:28:49Z

Hi @guyueh1, do you think having this kind of debug print would be useful?

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_rl/distributed/worker_groups.py (2)
597-656: Consider logging exceptions instead of silently suppressing them.

The method has three try-except-pass blocks (lines 608-609, 618-619, 655-656) that silently suppress all exceptions. While acceptable for a debug logging function, this makes troubleshooting difficult if the mapping fails to print. Consider logging exceptions at the debug level to aid future debugging.

Apply this pattern for better observability:
+import logging
+
+logger = logging.getLogger(__name__)
+
 def _print_worker_to_node_gpu_mapping(self) -> None:
     """Print mapping from workers to physical nodes and GPU indices."""
     try:
         ...
         for i, pg in enumerate(placement_groups):
             try:
                 pg_table = ray.util.placement_group_table(pg)
                 pg_bundle_to_node[i] = pg_table.get("bundles_to_node_id", {})
-            except Exception:
+            except Exception as e:
+                logger.debug(f"Failed to get placement group table for PG {i}: {e}")
                 pg_bundle_to_node[i] = {}
         ...
         try:
             for node in ray.nodes():
                 ...
-        except Exception:
+        except Exception as e:
+            logger.debug(f"Failed to retrieve node information: {e}")
             pass
         ...
-    except Exception:
+    except Exception as e:
+        logger.debug(f"Failed to print worker-to-node GPU mapping: {e}")
         pass
Based on coding guidelines.

633-633: Simplify fallback logic for bundle_idx.

The fallback meta.get("bundle_idx", meta.get("local_rank", 0)) may be unnecessary now that bundle_idx is explicitly added to metadata at lines 557 and 585. Consider simplifying to:
-bundle_idx = int(meta.get("bundle_idx", meta.get("local_rank", 0)))
+bundle_idx = int(meta["bundle_idx"])
The defensive fallback was reasonable before this PR, but with bundle_idx now guaranteed to be present in the metadata, direct access would be clearer.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a769cc and 01a120b.

📒 Files selected for processing (1)

nemo_rl/distributed/worker_groups.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/distributed/worker_groups.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/distributed/worker_groups.py

🧬 Code graph analysis (1)

nemo_rl/distributed/worker_groups.py (2)

tests/unit/models/generation/test_vllm_generation.py (1)

cluster (224-235)

nemo_rl/distributed/virtual_cluster.py (1)

get_placement_groups (350-358)

🪛 Ruff (0.14.0)

nemo_rl/distributed/worker_groups.py

608-608: Do not catch blind exception: Exception

(BLE001)

618-619: try-except-pass detected, consider logging the exception

(S110)

618-618: Do not catch blind exception: Exception

(BLE001)

655-656: try-except-pass detected, consider logging the exception

(S110)

655-655: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (3)

nemo_rl/distributed/worker_groups.py (3)

557-557: LGTM! Clean addition of bundle_idx to metadata.

The bundle_idx field is properly added to both the temporary worker_info during creation and the persistent _worker_metadata. This enables the new mapping functionality.

Also applies to: 585-585

593-593: LGTM! Appropriate placement of the mapping printout.

Invoking the print method after all workers are initialized ensures complete metadata is available for the mapping display.

606-607: Ray placement_group_table usage is correct
ray.util.placement_group_table accepts a PlacementGroup and returns a dict containing "bundles_to_node_id", matching the code’s usage.

guyueh1 · 2025-10-20T16:19:10Z

@youngeunkwon0405 I also tried to improve the observability of worker placement but in my PR it was something like this
#1341 (comment)

Do you think my print covers your need, or you specifically need to log the bundle_idx/worker mapping? It's fine to have them all in my opinion.

youngeunkwon0405 · 2025-10-20T18:15:16Z

eed to log the bundle_idx/worker mapping? It's fine to have them all in my opinion.

Hi @guyueh1, your print looks great! Could you please check your PR works fine also for the Generation workers and non-colocated setups? I think then I can close my PR.

print worker -> bundle idx

01a120b

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

youngeunkwon0405 requested a review from a team as a code owner October 15, 2025 18:27

youngeunkwon0405 marked this pull request as draft October 15, 2025 18:28

youngeunkwon0405 requested a review from guyueh1 October 15, 2025 18:28

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

youngeunkwon0405 self-assigned this Oct 15, 2025

euronymous-aithal requested review from guyueh1 and removed request for guyueh1 October 18, 2025 18:25

euronymous-aithal added Performance Related to improving performance r0.4.0 and removed Performance Related to improving performance labels Oct 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Printing worker id to node.bundle_idx mapping #1366

feat: Printing worker id to node.bundle_idx mapping #1366

Uh oh!

youngeunkwon0405 commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

youngeunkwon0405 commented Oct 15, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

guyueh1 commented Oct 20, 2025

Uh oh!

youngeunkwon0405 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Printing worker id to node.bundle_idx mapping #1366

Are you sure you want to change the base?

feat: Printing worker id to node.bundle_idx mapping #1366

Uh oh!

Conversation

youngeunkwon0405 commented Oct 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

youngeunkwon0405 commented Oct 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 commented Oct 20, 2025

Uh oh!

youngeunkwon0405 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

youngeunkwon0405 commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading