Skip to content

Conversation

youngeunkwon0405
Copy link
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Oct 15, 2025

What does this PR do ?

image

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added detailed logging that maps workers to their physical nodes and GPUs, providing an organized, tree-like view per node.
    • Enhanced worker identification in logs with bundle indices and global ranks for easier troubleshooting and monitoring.
    • Automatically infers and displays worker roles for clearer cluster overviews.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405 youngeunkwon0405 requested a review from a team as a code owner October 15, 2025 18:27
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

📝 Walkthrough

Walkthrough

Adds bundle_idx to worker metadata during creation and introduces a private method to print a worker-to-node/GPU mapping. The mapping is invoked right after workers are created, deriving node and IP details from Ray placement group tables and node info, and formatting workers by node with global_rank and bundle_idx.

Changes

Cohort / File(s) Summary of Changes
Worker metadata + diagnostics
nemo_rl/distributed/worker_groups.py
- Include bundle_idx in worker creation metadata and per-worker info
- Invoke a new private method after worker creation to print worker→node/GPU mapping
- Add _print_worker_to_node_gpu_mapping that resolves placement-group→node/IP, infers role, groups workers by node, and prints ordered tree; exceptions are safely handled

Sequence Diagram(s)

sequenceDiagram
  actor Dev as Caller
  participant WG as RayWorkerGroup
  participant Ray as Ray / Cluster
  participant PG as Placement Group Table
  participant Nodes as Ray Node Info
  participant Out as Console

  Dev->>WG: _create_workers_from_bundle_indices(bundle_indices)
  WG->>Ray: Create workers (actors) with metadata (global_rank, bundle_idx, ...)
  Ray-->>WG: Worker handles + placement group refs
  WG->>WG: _print_worker_to_node_gpu_mapping()
  WG->>PG: Query placement_group_table()
  PG-->>WG: PG index → node IDs
  WG->>Nodes: Fetch node list (IDs → IPs)
  Nodes-->>WG: Node metadata (IPs, resources)
  WG->>Out: Print workers grouped by node with global_rank and bundle_idx
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly conveys the primary addition of printing a mapping between worker IDs and their bundle indices on each node. It aligns with the core change of introducing the new printing method without extraneous detail or noise.
Test Results For Major Changes ✅ Passed The PR introduces a logging feature for displaying worker‐to‐node mappings without altering numeric algorithms or performance‐critical code, and it does not include any regressions in convergence or performance, so no test results are strictly required for this minor change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@youngeunkwon0405 youngeunkwon0405 marked this pull request as draft October 15, 2025 18:28
@youngeunkwon0405
Copy link
Contributor Author

Hi @guyueh1, do you think having this kind of debug print would be useful?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_rl/distributed/worker_groups.py (2)

597-656: Consider logging exceptions instead of silently suppressing them.

The method has three try-except-pass blocks (lines 608-609, 618-619, 655-656) that silently suppress all exceptions. While acceptable for a debug logging function, this makes troubleshooting difficult if the mapping fails to print. Consider logging exceptions at the debug level to aid future debugging.

Apply this pattern for better observability:

+import logging
+
+logger = logging.getLogger(__name__)
+
 def _print_worker_to_node_gpu_mapping(self) -> None:
     """Print mapping from workers to physical nodes and GPU indices."""
     try:
         ...
         for i, pg in enumerate(placement_groups):
             try:
                 pg_table = ray.util.placement_group_table(pg)
                 pg_bundle_to_node[i] = pg_table.get("bundles_to_node_id", {})
-            except Exception:
+            except Exception as e:
+                logger.debug(f"Failed to get placement group table for PG {i}: {e}")
                 pg_bundle_to_node[i] = {}
         ...
         try:
             for node in ray.nodes():
                 ...
-        except Exception:
+        except Exception as e:
+            logger.debug(f"Failed to retrieve node information: {e}")
             pass
         ...
-    except Exception:
+    except Exception as e:
+        logger.debug(f"Failed to print worker-to-node GPU mapping: {e}")
         pass

Based on coding guidelines.


633-633: Simplify fallback logic for bundle_idx.

The fallback meta.get("bundle_idx", meta.get("local_rank", 0)) may be unnecessary now that bundle_idx is explicitly added to metadata at lines 557 and 585. Consider simplifying to:

-bundle_idx = int(meta.get("bundle_idx", meta.get("local_rank", 0)))
+bundle_idx = int(meta["bundle_idx"])

The defensive fallback was reasonable before this PR, but with bundle_idx now guaranteed to be present in the metadata, direct access would be clearer.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a769cc and 01a120b.

📒 Files selected for processing (1)
  • nemo_rl/distributed/worker_groups.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/distributed/worker_groups.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/distributed/worker_groups.py
🧬 Code graph analysis (1)
nemo_rl/distributed/worker_groups.py (2)
tests/unit/models/generation/test_vllm_generation.py (1)
  • cluster (224-235)
nemo_rl/distributed/virtual_cluster.py (1)
  • get_placement_groups (350-358)
🪛 Ruff (0.14.0)
nemo_rl/distributed/worker_groups.py

608-608: Do not catch blind exception: Exception

(BLE001)


618-619: try-except-pass detected, consider logging the exception

(S110)


618-618: Do not catch blind exception: Exception

(BLE001)


655-656: try-except-pass detected, consider logging the exception

(S110)


655-655: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (3)
nemo_rl/distributed/worker_groups.py (3)

557-557: LGTM! Clean addition of bundle_idx to metadata.

The bundle_idx field is properly added to both the temporary worker_info during creation and the persistent _worker_metadata. This enables the new mapping functionality.

Also applies to: 585-585


593-593: LGTM! Appropriate placement of the mapping printout.

Invoking the print method after all workers are initialized ensures complete metadata is available for the mapping display.


606-607: Ray placement_group_table usage is correct
ray.util.placement_group_table accepts a PlacementGroup and returns a dict containing "bundles_to_node_id", matching the code’s usage.

@youngeunkwon0405 youngeunkwon0405 self-assigned this Oct 15, 2025
@euronymous-aithal euronymous-aithal requested review from guyueh1 and removed request for guyueh1 October 18, 2025 18:25
@euronymous-aithal euronymous-aithal added Performance Related to improving performance r0.4.0 and removed Performance Related to improving performance labels Oct 18, 2025
@guyueh1
Copy link
Contributor

guyueh1 commented Oct 20, 2025

@youngeunkwon0405 I also tried to improve the observability of worker placement but in my PR it was something like this
#1341 (comment)

Do you think my print covers your need, or you specifically need to log the bundle_idx/worker mapping? It's fine to have them all in my opinion.

@youngeunkwon0405
Copy link
Contributor Author

eed to log the bundle_idx/worker mapping? It's fine to have them all in my opinion.

Hi @guyueh1, your print looks great! Could you please check your PR works fine also for the Generation workers and non-colocated setups? I think then I can close my PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants