PegaFlow is a high-performance KV cache storage engine for LLM inference. Offload KV cache from GPU to host memory or SSD, and share it across nodes via RDMA.
- Decoupled from inference lifecycle — runs as an independent sidecar; KV cache survives engine restarts, scales independently, and is shared across instances
- Topology-aware, PCIe-saturating transfers — NUMA-aware pinned memory + layer-wise DMA to maximize hardware bandwidth
- GIL-free Rust core — zero Python overhead on the hot path; your inference engine keeps its threads
- Production-ready observability — built-in Prometheus metrics and OTLP export, not an afterthought
- Pluggable — works with vLLM and SGLang as a drop-in KV connector
| Framework | Status | Link |
|---|---|---|
| vLLM | ✅ Ready | Quick Start |
| SGLang | 🚧 Under Review | PR #17221 |
uv pip install pegaflow-llm # CUDA 12
uv pip install pegaflow-llm-cu13 # CUDA 13pegaflow-servervLLM (recommended):
vllm serve Qwen/Qwen3-0.6B \
--kv-transfer-config '{"kv_connector": "PegaKVConnector", "kv_role": "kv_both", "kv_connector_module_path": "pegaflow.connector"}'SGLang:
python3 -m sglang.launch_server \
--model-path Qwen/Qwen3-0.6B \
--enable-pegaflowFor full server options, multi-node setup, and advanced configuration, see Server Configuration.
export PYO3_PYTHON=$(which python)
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH
cargo run -r # start server
cd python && maturin develop -r # build Python bindingsWe use Conventional Commits — run cz c for an interactive commit prompt.
H800 reference numbers with Llama-3.1-8B (8 prompts, 10K-token prefill, 1-token decode, 4.0 req/s):
| Configuration | TTFT mean (ms) | TTFT p99 (ms) |
|---|---|---|
| PegaFlow (Cold) | 572.5 | 1113.7 |
| PegaFlow (Warm) | 61.5 | 77.0 |
The warm-start path achieves ~9x faster TTFT compared to cold-start, demonstrating effective KV cache sharing across requests.
- Server Configuration — full CLI options, SSD cache, multi-node setup
- P/D Router — prefill/decode disaggregation
- vLLM I/O Patch — optional patch for better transfer throughput
- Metrics — Prometheus and OTLP metrics reference
- Goals & Non-Goals — project scope and design philosophy
