edit cache warmup

tanpinsiang · tanpinsiang · commit 4b7e85453082 · 2025-10-27T14:12:13.000+08:00
diff --git a/_posts/2025-10-26-zero_reload_model_switching_with_vllm_sleep_mode.md b/_posts/2025-10-26-zero_reload_model_switching_with_vllm_sleep_mode.md
@@ -30,7 +30,7 @@ Even with instant weight loading, every cold start pays hidden costs that Sleep
 | 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved |
 | 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved |
 | 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ✅ Preserved (after initial warmup) |
-| 5. Cache warm-up | First-request overhead | ❌ Every time | ✅ Preserved (after initial warmup) |
+| 5. Cache warm-up | First-request overhead | ❌ Every time | ⚡ Quick re-warm |
 
 By keeping the process alive, Sleep Mode preserves infrastructure (#2-3) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts.