-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Motivation.
In current Prefill-Decode (PD) disaggregated serving architectures, the two methods for KV Cache transfer from prefill instances to decode instances— Device-to-Device (D2D) and centralized store-based (e.g., via LMCache or Mooncake Store)—are often mutually exclusive and incompatible within a single deployment. This rigidity hinders the ability to optimally meet diverse Service Level Objective (SLO) requirements.
A hybrid approach, where the transfer mechanism is dynamically selected per request according to its property and priority, would provide significant benefits:
- For SLO-sensitive requests (e.g., interactive chats), low-latency D2D transfer is ideal for minimizing Time to First Token (TTFT)
- For SLO-insensitive requests (e.g., background tasks of AI Agents), the store-based transfer promotes better load balance and utilization within a sub-cluster
| Decode Instance Binding Timing | Extra Data Copy | Load Balancing | Time Latency | Compute/Transfer Overlap |
|---|---|---|---|---|
| Before prefill | No | Poor | Low | Yes |
| After prefill | Yes | Good | High | No |
Furthermore, the transfer method can be integrated with the proxy's scheduling strategy to further optimize meeting SLO goals. In the current XpYd implementation, the proxy routes a request to a Prefill instance first and only selects a Decode instance after the prefill phase completes. This sequential process means KV Cache transfer cannot begin until prefill computation finishes and transfer latency cannot be overlapped with the forward computation, inevitably adding to the total end-to-end latency, making it challenging to achieve low TTFT. A superior strategy for latency-sensitive requests is to pre-bind both a Prefill and a Decode instance before prefill begins and overlap the layer-wise streaming of KV Cache blocks while the prefill computation is still ongoing. However, current KV connectors do not support this function yet.
In this RFC, we propose TransferDockConnector which enables hybrid kv_connector launching (store-based & D2D) in one serving process and request-level KVCache transfer timing and connector selection. In case that the prefill and decode instances take differently model parallelism strategies, TransferDockConnector also auto-manages kv cache resharding, either on prefill instances or centralized-store, according to requests’ requirements.
Proposed Change.
Key features:
- hybrid kv_connector launching
- flexible request-level kv_connector selection and transfer timing selection
- kv cache resharding
Proposed change:
- In proxy: classify incoming requests to determine SLO sensitivity
- In scheduler: select appropriate KV connector per request
- In worker/model runner: enable asynchronous layerwise KVCache transfer when pre-binding P/D instance for SLO-sensitive requests
- Resharding execution: perform KV Cache resharding on prefill instances (for D2D transfer) or on external stores (for store-based transfer)
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.