Skip to content

Commit 4b7e854

Browse files
committed
edit cache warmup
1 parent 0f1aa07 commit 4b7e854

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/2025-10-26-zero_reload_model_switching_with_vllm_sleep_mode.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Even with instant weight loading, every cold start pays hidden costs that Sleep
3030
| 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved |
3131
| 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved |
3232
| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ✅ Preserved (after initial warmup) |
33-
| 5. Cache warm-up | First-request overhead | ❌ Every time | ✅ Preserved (after initial warmup) |
33+
| 5. Cache warm-up | First-request overhead | ❌ Every time | ⚡ Quick re-warm |
3434

3535
By keeping the process alive, Sleep Mode preserves infrastructure (#2-3) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts.
3636

0 commit comments

Comments
 (0)