-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[serve] Static Placement Group RFC Implementation #59912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Static Placement Group RFC Implementation #59912
Conversation
c3d473f to
c17eb30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: static placement groups for Ray Serve. The implementation is comprehensive, touching the necessary components from the public API down to the deployment scheduler and state management. The new StaticPlacementConfig dataclass is well-designed with robust validation. The logic for handling static placement during scheduling and controller recovery is also well-thought-out.
I've identified one critical issue concerning rank management that could lead to a resource leak, and I've provided a detailed suggestion for a fix. I also pointed out a minor redundancy for code cleanup. Overall, this is a solid contribution that adds valuable functionality to Ray Serve. Addressing the critical issue is essential before merging.
| # For static placement, node_id may be None at rank assignment time | ||
| # since the node is determined by the placement group bundle. | ||
| # In this case, we skip local rank assignment and use placeholder values. | ||
| if node_id is None: | ||
| # Static placement: node_rank and local_rank are not meaningful | ||
| # since placement is determined by bundle indices, not node affinity | ||
| return ReplicaRank(rank=rank, node_rank=-1, local_rank=-1) | ||
|
|
||
| # Track the replica-to-node mapping | ||
| self._replica_to_node[replica_id] = node_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential bug here for static placement replicas. When node_id is None, self._replica_to_node is not populated for the replica_id. This will cause self.has_replica_rank(replica_id) to return False later, because it checks for replica_id in self._replica_to_node.
As a result, when a static placement replica is stopped, self._rank_manager.release_rank(replica_id) will not be called, leading to a rank leak.
To fix this, self._replica_to_node[replica_id] = node_id should be set even when node_id is None. This will require follow-up changes in has_replica_rank, release_rank, and recover_rank to correctly handle cases where node_id is None from self._replica_to_node.
I've suggested a change for this block below. You'll also need to update release_rank to handle node_id being None.
| # For static placement, node_id may be None at rank assignment time | |
| # since the node is determined by the placement group bundle. | |
| # In this case, we skip local rank assignment and use placeholder values. | |
| if node_id is None: | |
| # Static placement: node_rank and local_rank are not meaningful | |
| # since placement is determined by bundle indices, not node affinity | |
| return ReplicaRank(rank=rank, node_rank=-1, local_rank=-1) | |
| # Track the replica-to-node mapping | |
| self._replica_to_node[replica_id] = node_id | |
| # Track the replica-to-node mapping. For static placement, node_id will be | |
| # None initially. | |
| self._replica_to_node[replica_id] = node_id | |
| # For static placement, node_id may be None at rank assignment time | |
| # since the node is determined by the placement group bundle. | |
| # In this case, we skip local rank assignment and use placeholder values. | |
| if node_id is None: | |
| # Static placement: node_rank and local_rank are not meaningful | |
| # since placement is determined by bundle indices, not node affinity | |
| return ReplicaRank(rank=rank, node_rank=-1, local_rank=-1) |
| if static_placement_config is None: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c17eb30 to
f9a8b1f
Compare
Implements the Static Placement Group RFC (ray-project#59857) to enable external placement groups with explicit replica-to-bundle mapping for Ray Serve deployments. Key changes: - Add StaticPlacementConfig dataclass in config.py - Add _placement_info parameter to deployment decorator - Update scheduler for static placement groups - Add bundle_indices to ReplicaContext - Implement recovery for static placement - Add unit tests for StaticPlacementConfig
f9a8b1f to
7b6d922
Compare
Summary
Implements the Static Placement Group RFC (#59857) to enable external placement groups with explicit replica-to-bundle mapping for Ray Serve deployments.
Key features:
StaticPlacementConfigdataclass for external placement group configuration_placement_infoparameter on@serve.deploymentdecoratorbundle_indicesexposed viaserve.get_replica_context()Use case: GPU colocation between Serve deployments and other Ray components (e.g., RL training workflows requiring zero-copy weight sync via CUDA IPC).
Example Usage
Test plan
StaticPlacementConfigvalidation🤖 Generated with Claude Code