[RFC]: [P/D] A pluggable module for load-balanced & SLO-aware request routing and KV Cache Transfer

RFC PR 同时提？

### Motivation.

In prefill-decode (PD) disaggregated serving architectures, existing routing strategies have different advantanges. The sequential strategy—which routes requests to a prefill instance first and selects a decode instance only after prefill completes—provides potential for optimal, load-aware routing across decode instances. In contrast, a pre-bind strategy—selecting both prefill and decode instances concurrently—enables overlapping KV Cache transfer with computation, significantly reducing Time to First Token (TTFT).

Similarly, the choice of KV Cache transfer pathway involves a performance-scalability trade-off. Point-to-point transfer (e.g., via NIXL or Tranfer Engine) minimizes latency, while centralized store-based transfer (e.g., via LMCache or Mooncake Store) improves scalability and enhances prefix caching hit rates.

| Decode Instance Scheduling Time | Extra Data Copy | Load Balancing | Time Latency | Compute/KV Transfer Overlap |
|    :----:    |    :----:   |    :----:    |    :----:   |   :----:   |
|Before prefill | No | Poor | Low | Yes |
|After prefill | Yes | Good | High | No |

<p align="center">Table 1. Scheduling Time Trade-offs</p>

However, current deployments usually force a static choice between mutually exclusive P/D binding timing and KV Cache transfer pathways, which apply to all requests. This inflexibility prevents dynamic adaptation to diverse workload requirements—such as low-latency interactive chats, high-throughput AI agent tasks, or variable-depth research queries—hindering system ability to meet granular Service Level Objective (SLO) targets across different task types.

### Proposed Change.

In this RFC, we propose a lightweight (question!!!), pluggable module that introduces per-request orchestration of routing and KV transfer decisions, including:
1.	TransferDockController, a central controller that replaces the default proxy, which dynamically determines:
a).	Prefill and decode instances pairing timing and routing policy to use
b).	KV Cache transfer pathway and timing
Decisions are based on real-time instance workload and utilization, as well as task type and SLO requirement per request.

2.	Further development based on the existing MultiConnector, a counterpart on each instance that:
a).	Manages hybrid transfer backends concurrently and chooses certain transfer backend according to metadata bound to each request 

![overview](https://gitcode.com/ShirleyHuang/vllm_support/raw/main/diagram.png)

#### *Key features:* 
* Load-balanced & SLO-aware request routing
* Hybrid & flexible KV Cache Transfer per request


### Milestones for core feature implementation:
1. **Hybrid KV transfer backend selection at request-level**: involving the usage of MultiConnector & scheduler adaption
    a).	TransferDockController:
	*    Bind kv cache transfer timing and connector type info to request: `original_request_data["kvtransfer_start_timing"] = “prefill-completed” / “prefill-completed-async” / “layerwise-async” ` (request for layerwise PR), `prefill_request["connector_type"] = “mooncake_store” / “transfer_engine” / “hccs” / “nixl”`. (question!!!)

    b).	MultiConnector: 
	*    Pass pre-bound info from TransferDockController: to MultiConnector `self._requests_to_connector: dict[str, KVConnectorBase_V1]`, which is a mapping from request id to the index of the chosen connector. The connector info will be incorporated into MultiConnector through scheduler via a `“kvconnector_info”` field bound to the `Request` class.
	*    Bind the metadata to each connector in the order of the connectors in the `MultiConnectorMetaData` through `bind_connector_metadata()` (call by model_runner). 

    c).	Enable decoder to receive request metadata from prefill instances directly via *ZMQ Dealer-Router* mode
    *   For P2P connectors: on prefill side, a `zmq.router` socket will be deployed to listen for requests from decode to pull KV cache metadata (will apply `zmq.poller` to avoid occupy CPU) and on decode side, a `zmq.dealer` socket will be employed to listen for completion signal of prefill (potentially layerwise). (question!!!! maybe we can use pull mode!????)

2. **Workload & resource utilization report and monitoring**:  to be collected and reported with heartbeat signals, by a plugin reporter inserted into scheduler, involving TransferDockController and scheduler & llm_engine & api_server adaption
    a).	Request metrics from gateway/scheduler/TransferDockController:
	*    `request_length`: length of prompt 
	*    `request_type`: can be 'agent_background_task', 'multi-conversation', or 'simple_chat' etc
	*    `prompt_difficulty`: reflects the difficulty of the answering the prompt question, thus indicating the response length - a ranking model may be incorporated to classify prompts. 
	*    `step_response_length`: the average number of tokens that have already been stepped forward in running queue

    b).	Workload metrics from scheduler:
	*    `num_free_blocks`: number of free kv cache blocks that can be assigned to onging or new requests
	*    `step_tokens_to_be_calc`: number of tokens that is to be calculated in both waiting queue and running queue
	*    `latency_per_token`: the time latency per token calculated in one step

    c).	Cached KVBlock info published by kv_events:
	Subscribe kv events in TransferDockController and update a `global_cached_kv_map`

3. **Built-in and user-registrable routing policies integration into TransferDockController**
    a).    Schedule policies: (question!!!!)
	*    *Random / RoundRobin*: among all healthy instances
	*    *CacheMatched*: when system load is balanced, requests are preferentially routed to instances with the highest prefix-caching hit rate based on historical request data; when system load is imbalanced, it dynamically routes to prefill instances with the smallest total number of *pending request tokens* by perceiving real-time load conditions (question!!!)
	*    *LengthAware*: during decoding, schedule requests with similar lengths to the same instance to prevent throughput degradation caused by disparities in KV Cache loading times, thereby reducing TBT (beneficial in RL scenario)
	*    *LongTailSpecific*: reserve dedicated long-sequence decode instances and reduce batch size to deliver lower TBT (large-scale industrial deployment)

    b).	Choice of Schedule policies according to different request types
    *    Simple chat / brief conclusion -> Random / RoundRobin / CacheAware
    *    Multi-round conversation / AI agent background task -> LengthAware / CacheAware / LongTailSpecific


### Feedback Period.

One week

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]: [P/D] A pluggable module for load-balanced & SLO-aware request routing and KV Cache Transfer #2

Motivation.

Proposed Change.

Key features:

Milestones for core feature implementation:

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[RFC]: [P/D] A pluggable module for load-balanced & SLO-aware request routing and KV Cache Transfer #2

Description

Motivation.

Proposed Change.

Key features:

Milestones for core feature implementation:

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions