Skip to content

[RFC]: [P/D] A pluggable module for load-balanced & SLO-aware request routing and KV Cache Transfer #2

@MissFishY

Description

@MissFishY

RFC PR 同时提?

Motivation.

In prefill-decode (PD) disaggregated serving architectures, existing routing strategies have different advantanges. The sequential strategy—which routes requests to a prefill instance first and selects a decode instance only after prefill completes—provides potential for optimal, load-aware routing across decode instances. In contrast, a pre-bind strategy—selecting both prefill and decode instances concurrently—enables overlapping KV Cache transfer with computation, significantly reducing Time to First Token (TTFT).

Similarly, the choice of KV Cache transfer pathway involves a performance-scalability trade-off. Point-to-point transfer (e.g., via NIXL or Tranfer Engine) minimizes latency, while centralized store-based transfer (e.g., via LMCache or Mooncake Store) improves scalability and enhances prefix caching hit rates.

Decode Instance Scheduling Time Extra Data Copy Load Balancing Time Latency Compute/KV Transfer Overlap
Before prefill No Poor Low Yes
After prefill Yes Good High No

Table 1. Scheduling Time Trade-offs

However, current deployments usually force a static choice between mutually exclusive P/D binding timing and KV Cache transfer pathways, which apply to all requests. This inflexibility prevents dynamic adaptation to diverse workload requirements—such as low-latency interactive chats, high-throughput AI agent tasks, or variable-depth research queries—hindering system ability to meet granular Service Level Objective (SLO) targets across different task types.

Proposed Change.

In this RFC, we propose a lightweight (question!!!), pluggable module that introduces per-request orchestration of routing and KV transfer decisions, including:

  1. TransferDockController, a central controller that replaces the default proxy, which dynamically determines:
    a). Prefill and decode instances pairing timing and routing policy to use
    b). KV Cache transfer pathway and timing
    Decisions are based on real-time instance workload and utilization, as well as task type and SLO requirement per request.

  2. Further development based on the existing MultiConnector, a counterpart on each instance that:
    a). Manages hybrid transfer backends concurrently and chooses certain transfer backend according to metadata bound to each request

overview

Key features:

  • Load-balanced & SLO-aware request routing
  • Hybrid & flexible KV Cache Transfer per request

Milestones for core feature implementation:

  1. Hybrid KV transfer backend selection at request-level: involving the usage of MultiConnector & scheduler adaption
    a). TransferDockController:

    • Bind kv cache transfer timing and connector type info to request: original_request_data["kvtransfer_start_timing"] = “prefill-completed” / “prefill-completed-async” / “layerwise-async” (request for layerwise PR), prefill_request["connector_type"] = “mooncake_store” / “transfer_engine” / “hccs” / “nixl”. (question!!!)

    b). MultiConnector:

    • Pass pre-bound info from TransferDockController: to MultiConnector self._requests_to_connector: dict[str, KVConnectorBase_V1], which is a mapping from request id to the index of the chosen connector. The connector info will be incorporated into MultiConnector through scheduler via a “kvconnector_info” field bound to the Request class.
    • Bind the metadata to each connector in the order of the connectors in the MultiConnectorMetaData through bind_connector_metadata() (call by model_runner).

    c). Enable decoder to receive request metadata from prefill instances directly via ZMQ Dealer-Router mode

    • For P2P connectors: on prefill side, a zmq.router socket will be deployed to listen for requests from decode to pull KV cache metadata (will apply zmq.poller to avoid occupy CPU) and on decode side, a zmq.dealer socket will be employed to listen for completion signal of prefill (potentially layerwise). (question!!!! maybe we can use pull mode!????)
  2. Workload & resource utilization report and monitoring: to be collected and reported with heartbeat signals, by a plugin reporter inserted into scheduler, involving TransferDockController and scheduler & llm_engine & api_server adaption
    a). Request metrics from gateway/scheduler/TransferDockController:

    • request_length: length of prompt
    • request_type: can be 'agent_background_task', 'multi-conversation', or 'simple_chat' etc
    • prompt_difficulty: reflects the difficulty of the answering the prompt question, thus indicating the response length - a ranking model may be incorporated to classify prompts.
    • step_response_length: the average number of tokens that have already been stepped forward in running queue

    b). Workload metrics from scheduler:

    • num_free_blocks: number of free kv cache blocks that can be assigned to onging or new requests
    • step_tokens_to_be_calc: number of tokens that is to be calculated in both waiting queue and running queue
    • latency_per_token: the time latency per token calculated in one step

    c). Cached KVBlock info published by kv_events:
    Subscribe kv events in TransferDockController and update a global_cached_kv_map

  3. Built-in and user-registrable routing policies integration into TransferDockController
    a). Schedule policies: (question!!!!)

    • Random / RoundRobin: among all healthy instances
    • CacheMatched: when system load is balanced, requests are preferentially routed to instances with the highest prefix-caching hit rate based on historical request data; when system load is imbalanced, it dynamically routes to prefill instances with the smallest total number of pending request tokens by perceiving real-time load conditions (question!!!)
    • LengthAware: during decoding, schedule requests with similar lengths to the same instance to prevent throughput degradation caused by disparities in KV Cache loading times, thereby reducing TBT (beneficial in RL scenario)
    • LongTailSpecific: reserve dedicated long-sequence decode instances and reduce batch size to deliver lower TBT (large-scale industrial deployment)

    b). Choice of Schedule policies according to different request types

    • Simple chat / brief conclusion -> Random / RoundRobin / CacheAware
    • Multi-round conversation / AI agent background task -> LengthAware / CacheAware / LongTailSpecific

Feedback Period.

One week

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions