-
Notifications
You must be signed in to change notification settings - Fork 159
feat: Printing worker id to node.bundle_idx mapping #1366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Printing worker id to node.bundle_idx mapping #1366
Conversation
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
📝 WalkthroughWalkthroughAdds bundle_idx to worker metadata during creation and introduces a private method to print a worker-to-node/GPU mapping. The mapping is invoked right after workers are created, deriving node and IP details from Ray placement group tables and node info, and formatting workers by node with global_rank and bundle_idx. Changes
Sequence Diagram(s)sequenceDiagram
actor Dev as Caller
participant WG as RayWorkerGroup
participant Ray as Ray / Cluster
participant PG as Placement Group Table
participant Nodes as Ray Node Info
participant Out as Console
Dev->>WG: _create_workers_from_bundle_indices(bundle_indices)
WG->>Ray: Create workers (actors) with metadata (global_rank, bundle_idx, ...)
Ray-->>WG: Worker handles + placement group refs
WG->>WG: _print_worker_to_node_gpu_mapping()
WG->>PG: Query placement_group_table()
PG-->>WG: PG index → node IDs
WG->>Nodes: Fetch node list (IDs → IPs)
Nodes-->>WG: Node metadata (IPs, resources)
WG->>Out: Print workers grouped by node with global_rank and bundle_idx
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Hi @guyueh1, do you think having this kind of debug print would be useful? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
nemo_rl/distributed/worker_groups.py (2)
597-656
: Consider logging exceptions instead of silently suppressing them.The method has three
try
-except
-pass
blocks (lines 608-609, 618-619, 655-656) that silently suppress all exceptions. While acceptable for a debug logging function, this makes troubleshooting difficult if the mapping fails to print. Consider logging exceptions at the debug level to aid future debugging.Apply this pattern for better observability:
+import logging + +logger = logging.getLogger(__name__) + def _print_worker_to_node_gpu_mapping(self) -> None: """Print mapping from workers to physical nodes and GPU indices.""" try: ... for i, pg in enumerate(placement_groups): try: pg_table = ray.util.placement_group_table(pg) pg_bundle_to_node[i] = pg_table.get("bundles_to_node_id", {}) - except Exception: + except Exception as e: + logger.debug(f"Failed to get placement group table for PG {i}: {e}") pg_bundle_to_node[i] = {} ... try: for node in ray.nodes(): ... - except Exception: + except Exception as e: + logger.debug(f"Failed to retrieve node information: {e}") pass ... - except Exception: + except Exception as e: + logger.debug(f"Failed to print worker-to-node GPU mapping: {e}") passBased on coding guidelines.
633-633
: Simplify fallback logic for bundle_idx.The fallback
meta.get("bundle_idx", meta.get("local_rank", 0))
may be unnecessary now thatbundle_idx
is explicitly added to metadata at lines 557 and 585. Consider simplifying to:-bundle_idx = int(meta.get("bundle_idx", meta.get("local_rank", 0))) +bundle_idx = int(meta["bundle_idx"])The defensive fallback was reasonable before this PR, but with
bundle_idx
now guaranteed to be present in the metadata, direct access would be clearer.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_rl/distributed/worker_groups.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/distributed/worker_groups.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py
: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/distributed/worker_groups.py
🧬 Code graph analysis (1)
nemo_rl/distributed/worker_groups.py (2)
tests/unit/models/generation/test_vllm_generation.py (1)
cluster
(224-235)nemo_rl/distributed/virtual_cluster.py (1)
get_placement_groups
(350-358)
🪛 Ruff (0.14.0)
nemo_rl/distributed/worker_groups.py
608-608: Do not catch blind exception: Exception
(BLE001)
618-619: try
-except
-pass
detected, consider logging the exception
(S110)
618-618: Do not catch blind exception: Exception
(BLE001)
655-656: try
-except
-pass
detected, consider logging the exception
(S110)
655-655: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (3)
nemo_rl/distributed/worker_groups.py (3)
557-557
: LGTM! Clean addition of bundle_idx to metadata.The
bundle_idx
field is properly added to both the temporaryworker_info
during creation and the persistent_worker_metadata
. This enables the new mapping functionality.Also applies to: 585-585
593-593
: LGTM! Appropriate placement of the mapping printout.Invoking the print method after all workers are initialized ensures complete metadata is available for the mapping display.
606-607
: Ray placement_group_table usage is correct
ray.util.placement_group_table
accepts aPlacementGroup
and returns a dict containing"bundles_to_node_id"
, matching the code’s usage.
@youngeunkwon0405 I also tried to improve the observability of worker placement but in my PR it was something like this Do you think my print covers your need, or you specifically need to log the bundle_idx/worker mapping? It's fine to have them all in my opinion. |
Hi @guyueh1, your print looks great! Could you please check your PR works fine also for the Generation workers and non-colocated setups? I think then I can close my PR. |
What does this PR do ?
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit