Skip to content

Conversation

@nrghosh
Copy link
Contributor

@nrghosh nrghosh commented Jan 7, 2026

Summary

Implements the Static Placement Group RFC (#59857) to enable external placement groups with explicit replica-to-bundle mapping for Ray Serve deployments.

Key features:

  • New StaticPlacementConfig dataclass for external placement group configuration
  • _placement_info parameter on @serve.deployment decorator
  • bundle_indices exposed via serve.get_replica_context()
  • Recovery support: replicas restart on identical bundles

Use case: GPU colocation between Serve deployments and other Ray components (e.g., RL training workflows requiring zero-copy weight sync via CUDA IPC).

Example Usage

from ray.util.placement_group import placement_group
from ray import serve
from ray.serve.config import StaticPlacementConfig

# Create external placement group
pg = placement_group([{"GPU": 1, "CPU": 1}] * 4)
ray.get(pg.ready())

@serve.deployment(
    _placement_info=StaticPlacementConfig(
        placement_group=pg,
        replica_bundle_mapping={
            0: [0, 1],  # Replica 0 uses bundles 0 and 1
            1: [2, 3],  # Replica 1 uses bundles 2 and 3
        },
    ),
)
class MyLLMServer:
    def __init__(self):
        ctx = serve.get_replica_context()
        print(f"Replica {ctx.rank} using bundles: {ctx.bundle_indices}")

Test plan

  • Unit tests for StaticPlacementConfig validation
  • Integration tests with actual placement groups
  • Recovery test: controller restart with live replicas
  • Verify mutual exclusivity validation with autoscaling

🤖 Generated with Claude Code

@nrghosh nrghosh force-pushed the nrghosh/static-placement-group-rfc branch from c3d473f to c17eb30 Compare January 7, 2026 03:25
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: static placement groups for Ray Serve. The implementation is comprehensive, touching the necessary components from the public API down to the deployment scheduler and state management. The new StaticPlacementConfig dataclass is well-designed with robust validation. The logic for handling static placement during scheduling and controller recovery is also well-thought-out.

I've identified one critical issue concerning rank management that could lead to a resource leak, and I've provided a detailed suggestion for a fix. I also pointed out a minor redundancy for code cleanup. Overall, this is a solid contribution that adds valuable functionality to Ray Serve. Addressing the critical issue is essential before merging.

Comment on lines +1790 to +1799
# For static placement, node_id may be None at rank assignment time
# since the node is determined by the placement group bundle.
# In this case, we skip local rank assignment and use placeholder values.
if node_id is None:
# Static placement: node_rank and local_rank are not meaningful
# since placement is determined by bundle indices, not node affinity
return ReplicaRank(rank=rank, node_rank=-1, local_rank=-1)

# Track the replica-to-node mapping
self._replica_to_node[replica_id] = node_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a potential bug here for static placement replicas. When node_id is None, self._replica_to_node is not populated for the replica_id. This will cause self.has_replica_rank(replica_id) to return False later, because it checks for replica_id in self._replica_to_node.

As a result, when a static placement replica is stopped, self._rank_manager.release_rank(replica_id) will not be called, leading to a rank leak.

To fix this, self._replica_to_node[replica_id] = node_id should be set even when node_id is None. This will require follow-up changes in has_replica_rank, release_rank, and recover_rank to correctly handle cases where node_id is None from self._replica_to_node.

I've suggested a change for this block below. You'll also need to update release_rank to handle node_id being None.

Suggested change
# For static placement, node_id may be None at rank assignment time
# since the node is determined by the placement group bundle.
# In this case, we skip local rank assignment and use placeholder values.
if node_id is None:
# Static placement: node_rank and local_rank are not meaningful
# since placement is determined by bundle indices, not node affinity
return ReplicaRank(rank=rank, node_rank=-1, local_rank=-1)
# Track the replica-to-node mapping
self._replica_to_node[replica_id] = node_id
# Track the replica-to-node mapping. For static placement, node_id will be
# None initially.
self._replica_to_node[replica_id] = node_id
# For static placement, node_id may be None at rank assignment time
# since the node is determined by the placement group bundle.
# In this case, we skip local rank assignment and use placeholder values.
if node_id is None:
# Static placement: node_rank and local_rank are not meaningful
# since placement is determined by bundle indices, not node affinity
return ReplicaRank(rank=rank, node_rank=-1, local_rank=-1)

Comment on lines +719 to +720
if static_placement_config is None:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This None check is redundant. The type hint for static_placement_config is non-optional, and the only caller in _check_startup_replicas already ensures it's not None before calling this method. You can remove these lines for cleaner code.

@nrghosh nrghosh force-pushed the nrghosh/static-placement-group-rfc branch from c17eb30 to f9a8b1f Compare January 7, 2026 03:26
Implements the Static Placement Group RFC (ray-project#59857) to enable external
placement groups with explicit replica-to-bundle mapping for Ray Serve
deployments.

Key changes:
- Add StaticPlacementConfig dataclass in config.py
- Add _placement_info parameter to deployment decorator
- Update scheduler for static placement groups
- Add bundle_indices to ReplicaContext
- Implement recovery for static placement
- Add unit tests for StaticPlacementConfig
@nrghosh nrghosh force-pushed the nrghosh/static-placement-group-rfc branch from f9a8b1f to 7b6d922 Compare January 8, 2026 21:52
@nrghosh nrghosh closed this Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant