Skip to content

[RFC] [P/D] flexible request-level kvcache transfer timing and kv_connector selection #1

@MissFishY

Description

@MissFishY

Motivation.

In current Prefill-Decode (PD) disaggregated serving architectures, the two methods for KV Cache transfer from prefill instances to decode instances— Device-to-Device (D2D) and centralized store-based (e.g., via LMCache or Mooncake Store)—are often mutually exclusive and incompatible within a single deployment. This rigidity hinders the ability to optimally meet diverse Service Level Objective (SLO) requirements.

A hybrid approach, where the transfer mechanism is dynamically selected per request according to its property and priority, would provide significant benefits:

  • For SLO-sensitive requests (e.g., interactive chats), low-latency D2D transfer is ideal for minimizing Time to First Token (TTFT)
  • For SLO-insensitive requests (e.g., background tasks of AI Agents), the store-based transfer promotes better load balance and utilization within a sub-cluster
Decode Instance Binding Timing Extra Data Copy Load Balancing Time Latency Compute/Transfer Overlap
Before prefill No Poor Low Yes
After prefill Yes Good High No

Furthermore, the transfer method can be integrated with the proxy's scheduling strategy to further optimize meeting SLO goals. In the current XpYd implementation, the proxy routes a request to a Prefill instance first and only selects a Decode instance after the prefill phase completes. This sequential process means KV Cache transfer cannot begin until prefill computation finishes and transfer latency cannot be overlapped with the forward computation, inevitably adding to the total end-to-end latency, making it challenging to achieve low TTFT. A superior strategy for latency-sensitive requests is to pre-bind both a Prefill and a Decode instance before prefill begins and overlap the layer-wise streaming of KV Cache blocks while the prefill computation is still ongoing. However, current KV connectors do not support this function yet.

In this RFC, we propose TransferDockConnector which enables hybrid kv_connector launching (store-based & D2D) in one serving process and request-level KVCache transfer timing and connector selection. In case that the prefill and decode instances take differently model parallelism strategies, TransferDockConnector also auto-manages kv cache resharding, either on prefill instances or centralized-store, according to requests’ requirements.

Proposed Change.

Key features:

  • hybrid kv_connector launching
  • flexible request-level kv_connector selection and transfer timing selection
  • kv cache resharding

Proposed change:

  1. In proxy: classify incoming requests to determine SLO sensitivity
  2. In scheduler: select appropriate KV connector per request
  3. In worker/model runner: enable asynchronous layerwise KVCache transfer when pre-binding P/D instance for SLO-sensitive requests
  4. Resharding execution: perform KV Cache resharding on prefill instances (for D2D transfer) or on external stores (for store-based transfer)

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions