Skip to content

Commit 0f1aa07

Browse files
committed
minor edit.
1 parent 9dfd086 commit 0f1aa07

File tree

1 file changed

+17
-12
lines changed

1 file changed

+17
-12
lines changed

_posts/2025-10-26-zero_reload_model_switching_with_vllm_sleep_mode.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,8 @@ Even with instant weight loading, every cold start pays hidden costs that Sleep
2929
| 1. VRAM load time | Copying weights to GPU | ✅ Optimized | ✅ Preserved |
3030
| 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved |
3131
| 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved |
32-
| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ⚡ Quick re-warm |
33-
| 5. Cache warm-up | First-request overhead | ❌ Every time | ⚡ Quick re-warm |
32+
| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ✅ Preserved (after initial warmup) |
33+
| 5. Cache warm-up | First-request overhead | ❌ Every time | ✅ Preserved (after initial warmup) |
3434

3535
By keeping the process alive, Sleep Mode preserves infrastructure (#2-3) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts.
3636

@@ -120,6 +120,7 @@ Beyond faster model switching, Sleep Mode also delivers **faster inference times
120120
<div id="plotly-inference-comparison" style="width: 100%; height: 300px;"></div>
121121
<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
122122
Inference time comparison showing wake mode (already warmed up) vs cold start (just loaded).<br>
123+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
123124
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
124125
GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
125126
</div>
@@ -137,7 +138,7 @@ The 61-88% inference speedup isn't from faster weight loading—it's from **pres
137138
| Memory allocator (CuMemAllocator) | ✅ Yes | ❌ Reinitialize every time |
138139
| CUDA graphs | ✅ Yes | ❌ Re-capture every time |
139140
| Process state (Python, CUDA context) | ✅ Yes | ❌ Restart every time |
140-
| GPU kernel JIT cache | ⚡ Quick re-warm | ❌ Recompile every time |
141+
| GPU kernel JIT cache | ✅ Yes (after initial warmup) | ❌ Recompile every time |
141142

142143
**The Critical Difference:**
143144

@@ -149,8 +150,7 @@ The 61-88% inference speedup isn't from faster weight loading—it's from **pres
149150
- **Result:** First inference is **4-7x slower** (see benchmarks: 0.92s wake vs 3.72s cold start)
150151

151152
- **With Sleep Mode:** Process stays alive → **Pre-warm-up pays off**
152-
- ✅ Allocator, graphs, and process state preserved
153-
- ⚡ Only JIT kernels need quick re-warm
153+
- ✅ Allocator, graphs, process state, and JIT kernels all preserved after initial warmup
154154
- **Result:** First inference stays fast (~1s), avoiding the 3-4s cold start penalty
155155

156156
> [!NOTE]
@@ -191,6 +191,7 @@ Sleep Mode benefits aren't limited to high-end GPUs. Here's the same workload on
191191
<div id="plotly-inference-a4000" style="width: 100%; height: 300px;"></div>
192192
<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
193193
Inference time comparison on A4000: wake mode (already warmed up) vs cold start (just loaded).<br>
194+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
194195
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
195196
GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
196197
</div>
@@ -270,12 +271,12 @@ When you reload a model without Sleep Mode, you pay all these costs:
270271
| 2. Process initialization |**Skipped** | ❌ Must pay |
271272
| 3. Memory allocator setup |**Skipped** | ❌ Must pay |
272273
| 4. CUDA graph capture |**Skipped** | ❌ Must pay |
273-
| 5. GPU kernel JIT compilation | ⚡ Lazy recompile on first inference | ❌ Full compilation + warm-up |
274+
| 5. GPU kernel JIT compilation | **Preserved (already compiled)** | ❌ Full compilation + warm-up |
274275

275276
**Level 2 Strategy:**
276277
- Weight reload from SSD (same as No Sleep)
277-
- **Everything else preserved:** Process state, allocator instance, CUDA graphs all intact
278-
- **JIT kernels:** Lazily recompile on first inference (no explicit warm-up overhead)
278+
- **Everything else preserved:** Process state, allocator instance, CUDA graphs, and compiled JIT kernels all intact
279+
- **No recompilation needed:** Kernels were compiled during initial warmup and remain cached
279280
- **Average per switch: ~2.6s** (see benchmark data above)
280281

281282
**No Sleep Mode Reality:**
@@ -296,6 +297,7 @@ Even though both reload weights from SSD, Level 2 is **2.9x faster overall** bec
296297
<div id="plotly-level2-inference" style="width: 100%; height: 300px;"></div>
297298
<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
298299
Inference time comparison with Sleep Level 2: wake mode vs cold start.<br>
300+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
299301
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
300302
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
301303
</div>
@@ -363,9 +365,11 @@ Does skipping the warm-up phase affect performance? Warm-up pre-compiles CUDA gr
363365
| **Total Time (5 switches)** | 119.5s | 119.0s | Nearly identical |
364366

365367
**Insights:**
366-
- **Warm-Up is Essential for First Inference:** Without warm-up, the first inference after wake is 5-7x slower (lazy CUDA graph compilation)
367-
- **Subsequent Inferences Are Fast:** After the first inference compiles the graphs, performance normalizes
368-
- **Trade Initial Load Time for User Experience:** The 8.4s warm-up cost is amortized across all subsequent fast inferences
368+
- **Warm-Up Compiles Kernels Once, Benefits All Wake Cycles:** With initial warmup, JIT compilation and CUDA graph capture happen once during load and are preserved across all subsequent sleep/wake cycles
369+
- **Without Warm-Up, Every Wake-Up Pays Compilation Cost:** The 5-7x slowdown happens on the first inference after **every single wake-up**, not just once
370+
- **Compiled Kernels Are Preserved Across Sleep/Wake:** After warmup during initial load (8.4s), all subsequent wake-ups have fast first inference (0.45s, 0.93s) proving kernels stay cached
371+
- **Minimal Warmup Sufficient:** A single 1-token inference is enough to trigger full JIT compilation and CUDA graph capture, making warmup very cheap
372+
- **Trade Initial Load Time for Consistent Performance:** The 8.4s warmup cost is paid once and amortized across all model switches
369373
- **Recommendation: Always Use Warm-Up** for production workloads where consistent, fast inference is expected
370374

371375
### Impact of Quantization on Sleep Mode
@@ -388,6 +392,7 @@ Does quantization (FP8) affect Sleep Mode performance? We tested the same worklo
388392
<div id="plotly-ablation-inference" style="width: 100%; height: 300px;"></div>
389393
<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
390394
Inference time comparison: BF16 vs FP8 quantization with Sleep Mode.<br>
395+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
391396
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
392397
GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
393398
</div>
@@ -455,4 +460,4 @@ The future of LLM serving is multi-model. Sleep Mode makes it practical today.
455460

456461
## Acknowledgements
457462

458-
Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, and **Kaichao You** for developing the Sleep Mode feature and inspiring this blog post.
463+
Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, **Ye Hur Cheong** and **Kaichao You** for developing the Sleep Mode feature and inspiring this blog post.

0 commit comments

Comments
 (0)