You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-10-26-zero_reload_model_switching_with_vllm_sleep_mode.md
+17-12Lines changed: 17 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,8 +29,8 @@ Even with instant weight loading, every cold start pays hidden costs that Sleep
29
29
| 1. VRAM load time | Copying weights to GPU | ✅ Optimized | ✅ Preserved |
30
30
| 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved |
31
31
| 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved |
32
-
| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time |⚡ Quick re-warm|
33
-
| 5. Cache warm-up | First-request overhead | ❌ Every time |⚡ Quick re-warm|
32
+
| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time |✅ Preserved (after initial warmup)|
33
+
| 5. Cache warm-up | First-request overhead | ❌ Every time |✅ Preserved (after initial warmup)|
34
34
35
35
By keeping the process alive, Sleep Mode preserves infrastructure (#2-3) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts.
36
36
@@ -120,6 +120,7 @@ Beyond faster model switching, Sleep Mode also delivers **faster inference times
Inference time comparison showing wake mode (already warmed up) vs cold start (just loaded).<br>
123
+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
123
124
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
Inference time comparison on A4000: wake mode (already warmed up) vs cold start (just loaded).<br>
194
+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
194
195
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
Inference time comparison with Sleep Level 2: wake mode vs cold start.<br>
300
+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
299
301
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
-**Warm-Up is Essential for First Inference:** Without warm-up, the first inference after wake is 5-7x slower (lazy CUDA graph compilation)
367
-
-**Subsequent Inferences Are Fast:** After the first inference compiles the graphs, performance normalizes
368
-
-**Trade Initial Load Time for User Experience:** The 8.4s warm-up cost is amortized across all subsequent fast inferences
368
+
-**Warm-Up Compiles Kernels Once, Benefits All Wake Cycles:** With initial warmup, JIT compilation and CUDA graph capture happen once during load and are preserved across all subsequent sleep/wake cycles
369
+
-**Without Warm-Up, Every Wake-Up Pays Compilation Cost:** The 5-7x slowdown happens on the first inference after **every single wake-up**, not just once
370
+
-**Compiled Kernels Are Preserved Across Sleep/Wake:** After warmup during initial load (8.4s), all subsequent wake-ups have fast first inference (0.45s, 0.93s) proving kernels stay cached
371
+
-**Minimal Warmup Sufficient:** A single 1-token inference is enough to trigger full JIT compilation and CUDA graph capture, making warmup very cheap
372
+
-**Trade Initial Load Time for Consistent Performance:** The 8.4s warmup cost is paid once and amortized across all model switches
369
373
-**Recommendation: Always Use Warm-Up** for production workloads where consistent, fast inference is expected
370
374
371
375
### Impact of Quantization on Sleep Mode
@@ -388,6 +392,7 @@ Does quantization (FP8) affect Sleep Mode performance? We tested the same worklo
Inference time comparison: BF16 vs FP8 quantization with Sleep Mode.<br>
395
+
<strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
391
396
Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
@@ -455,4 +460,4 @@ The future of LLM serving is multi-model. Sleep Mode makes it practical today.
455
460
456
461
## Acknowledgements
457
462
458
-
Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, and **Kaichao You** for developing the Sleep Mode feature and inspiring this blog post.
463
+
Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, **Ye Hur Cheong**and **Kaichao You** for developing the Sleep Mode feature and inspiring this blog post.
0 commit comments