diff --git a/_posts/2025-10-26-sleep-mode.md b/_posts/2025-10-26-sleep-mode.md
new file mode 100644
index 0000000..316207f
--- /dev/null
+++ b/_posts/2025-10-26-sleep-mode.md
@@ -0,0 +1,471 @@
+---
+layout: post
+title: "Zero-Reload Model Switching with vLLM Sleep Mode"
+author: "Embedded LLM"
+image: /assets/figures/2025-vllm-sleep-mode/sleepmode.png
+thumbnail-img: /assets/figures/2025-vllm-sleep-mode/sleepmode.png
+share-img: /assets/figures/2025-vllm-sleep-mode/sleepmode.png
+---
+
+## Introduction
+
+**The multi-model serving problem:** You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff:
+
+1. **Keep both models loaded** → Requires 2x the GPU memory (expensive, often impossible)
+2. **Reload models on-demand** → 30-100+ seconds per switch (slow, wasteful)
+
+![vLLM Sleep Mode](/assets/figures/2025-vllm-sleep-mode/sleepmode.png)
+
+**vLLM Sleep Mode offers a third way:** Models hibernate in seconds and wake up fast—delivering the efficiency of on-demand loading with the speed of persistent serving.
+
+### Two Sleep Levels for Different Needs
+
+- **Level 1:** Offloads weights to CPU RAM (fast wake time)
+- **Level 2:** Discards weights entirely (nearly as fast wake time, minimal RAM usage)
+
+Both levels are **18-200x faster** than full reload and work seamlessly with Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP).
+
+### Why Sleep Mode Beats Fast Weight Loaders
+
+Even with instant weight loading, every cold start pays hidden costs that Sleep Mode avoids:
+
+| Cost | Description | Fast Weight Loaders | Sleep Mode |
+|------|-------------|---------------------|------------|
+| 1. VRAM load time | Copying weights to GPU | ✅ Optimized | ✅ Preserved |
+| 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved |
+| 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved |
+| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ✅ Preserved (after initial warmup) |
+| 5. Cache warm-up | First-request overhead | ❌ Every time | ⚡ Quick re-warm |
+
+By keeping the process alive, Sleep Mode preserves infrastructure (#2-4) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts.
+
+**This post covers:**
+- Comprehensive benchmarks across model sizes (0.6B to 235B) and GPUs (A4000 to A100)
+- Technical deep-dives explaining the performance gains
+- Ablation studies on warm-up impact and FP8 quantization
+- Decision guide for choosing the right sleep level
+
+## Quick Start: Using Sleep Mode
+
+### Online Serving API
+
+Start two vLLM servers with Sleep Mode enabled:
+
+```bash
+# Terminal 1: Start Phi-3-vision
+export VLLM_SERVER_DEV_MODE=1
+vllm serve microsoft/Phi-3-vision-128k-instruct --enable-sleep-mode --port 8001
+
+# Terminal 2: Start Qwen3-0.6B
+export VLLM_SERVER_DEV_MODE=1
+vllm serve Qwen/Qwen3-0.6B --enable-sleep-mode --port 8002
+```
+
+### Sleep and Wake Models
+
+```bash
+# Put Phi-3-vision to sleep (Level 2 - minimal RAM usage)
+curl -X POST 'localhost:8001/sleep?level=2'
+
+# Put Qwen3-0.6B to sleep (Level 2)
+curl -X POST 'localhost:8002/sleep?level=2'
+
+# Wake up Phi-3-vision for inference
+curl -X POST 'localhost:8001/wake_up'
+curl -X POST 'localhost:8001/collective_rpc' \
+  -H 'Content-Type: application/json' \
+  -d '{"method":"reload_weights"}'
+
+# IMPORTANT: Reset prefix cache after waking (Level 2 only)
+curl -X POST 'localhost:8001/reset_prefix_cache'
+
+# Now run inference on Phi-3-vision...
+# (your inference requests here)
+
+# Put back to sleep when done
+curl -X POST 'localhost:8001/sleep?level=2'
+
+# Wake up Qwen3-0.6B
+curl -X POST 'localhost:8002/wake_up'
+# (Level 1 doesn't need reload_weights or reset_prefix_cache)
+
+# Run inference on Qwen3-0.6B...
+```
+
+> [!NOTE]
+> For Level 2 sleep, you must call `reload_weights` and `reset_prefix_cache` after waking. Level 1 sleep doesn't require these extra steps.
+
+> [!WARNING]
+> **Security:** The `/sleep`, `/wake_up`, `/collective_rpc`, and `/reset_prefix_cache` endpoints require `VLLM_SERVER_DEV_MODE=1` and should only be exposed in trusted networks. These administrative endpoints can disrupt service and are intended for closed environments like training clusters or backend applications.
+
+## Performance Overview
+
+Let's see how Sleep Mode performs compared to traditional model reloading.
+
+### Sleep Mode L1 vs No Sleep Mode Performance
+
+The interactive chart below shows the **total time to perform 5 model switches**: running inference on Model A, switching to Model B, running inference on Model B, then repeating this pattern (A→B→A→B→A→B).
+
+**With Sleep Mode:** Models sleep/wake between switches, preserving infrastructure.
+**Without Sleep Mode:** Each switch requires a full vLLM restart and reload.
+
+<div style="margin: 2rem 0;">
+<script src="https://cdn.plot.ly/plotly-2.32.0.min.js"></script>
+<div id="plotly-sleep-mode" style="width: 100%; height: 250px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  <strong>Model A:</strong> Qwen3-235B-A22B-Instruct-2507-FP8 (TP=4) | <strong>Model B:</strong> Qwen3-Coder-30B-A3B-Instruct (TP=1)<br>
+  GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br>
+  
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js"></script>
+</div>
+
+## Inference Performance Boost
+
+Beyond faster model switching, Sleep Mode also delivers **faster inference times**. Because models are already warmed up when woken from sleep, they skip the cold start overhead that affects freshly loaded models.
+
+<div style="margin: 2rem 0;">
+<div id="plotly-inference-comparison" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Inference time comparison showing wake mode (already warmed up) vs cold start (just loaded).<br>
+  <strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js"></script>
+</div>
+
+#### Why Sleep Mode Improves Inference Speed
+
+The 61-88% inference speedup isn't from faster weight loading—it's from **preserving expensive infrastructure** that cold starts must rebuild from scratch.
+
+**What Sleep Mode Preserves:**
+
+| Component | Preserved? | Cold Start Must Pay |
+|-----------|-----------|---------------------|
+| Memory allocator (CuMemAllocator) | ✅ Yes | ❌ Reinitialize every time |
+| CUDA graphs | ✅ Yes | ❌ Re-capture every time |
+| Process state (Python, CUDA context) | ✅ Yes | ❌ Restart every time |
+| GPU kernel JIT cache | ✅ Yes (after initial warmup) | ❌ Recompile every time |
+
+**The Critical Difference:**
+
+- **Without Sleep Mode:** Process dies on unload → **You CANNOT benefit from pre-warm-up**
+  - Must restart Python process and CUDA context
+  - Must reinitialize memory allocator
+  - Must re-capture CUDA graphs
+  - Must re-JIT compile kernels (DeepGEMM, FlashInfer, TorchInductor)
+  - **Result:** First inference is **4-7x slower** (see benchmarks: 0.92s wake vs 3.72s cold start)
+
+- **With Sleep Mode:** Process stays alive → **Pre-warm-up pays off**
+  - ✅ Allocator, graphs, process state, and JIT kernels all preserved after initial warmup
+  - **Result:** First inference stays fast (~1s), avoiding the 3-4s cold start penalty
+
+> [!NOTE]
+> Timing varies significantly by model size, GPU generation, and configuration. See the [Impact of Warm-Up](#impact-of-warm-up-on-sleep-mode) section for detailed measurements showing 5-7x slowdown without warm-up.
+
+## Model Switching Performance
+
+The most dramatic benefit of Sleep Mode is in model switching time. Waking a sleeping model is **18-20x faster** than loading a fresh vLLM instance.
+
+<div style="margin: 2rem 0;">
+<div id="plotly-switching-comparison" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Model switching time: Wake from sleep vs cold start (fresh load).<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js"></script>
+</div>
+
+## Hardware Scalability: A4000 GPU Results
+
+Sleep Mode benefits aren't limited to high-end GPUs. Here's the same workload on an **A4000 GPU** with smaller models, demonstrating that the performance gains scale across different hardware tiers and model sizes.
+
+<div style="margin: 2rem 0;">
+<script src="https://cdn.plot.ly/plotly-2.32.0.min.js"></script>
+<div id="plotly-sleep-mode-a4000" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  <strong>Model A:</strong> Qwen3-0.6B | <strong>Model B:</strong> Phi-3-vision-128k-instruct<br>
+  GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br>
+  
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js"></script>
+</div>
+
+### A4000: Inference Performance
+
+<div style="margin: 2rem 0;">
+<div id="plotly-inference-a4000" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Inference time comparison on A4000: wake mode (already warmed up) vs cold start (just loaded).<br>
+  <strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js"></script>
+</div>
+
+### A4000: Model Switching Performance
+
+<div style="margin: 2rem 0;">
+<div id="plotly-switching-a4000" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Model switching time on A4000: Wake from sleep vs cold start (fresh load).<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js"></script>
+</div>
+
+**Key Observations on A4000:**
+- **Inference Performance:** Wake mode delivers 83% faster inference for Qwen3-0.6B and 81% faster for Phi-3-vision
+- **Model Switching:** Wake times are incredibly fast (~0.1-0.8s), achieving **58-203x speedup** vs cold starts
+- **Total time savings: 62%** (85s vs 226s for 5 model switches)
+- **Near-instant switching** for small models (0.1s wake time), making multi-model serving feel seamless
+- Demonstrates that Sleep Mode is effective across different GPU classes and model sizes
+
+## Sleep Levels: Choosing the Right Mode
+
+vLLM Sleep Mode offers two levels with different tradeoffs:
+
+**Level 1 (Default):** Offloads model weights to CPU memory, discards KV cache
+- **Fastest wake times** (~0.1-0.8s for small models, ~3-6s for large models)
+- **Requires sufficient CPU RAM** to store model weights
+- **Best for:** Systems with adequate CPU memory, frequent model switching
+
+**Level 2:** Discards model weights and KV cache, keeps only buffers (rope scaling tensors, etc.) in CPU
+- **Slower wake times** (~0.8-2.6s for small models) due to weight reload from disk
+- **Minimal CPU RAM usage** - only small buffers retained
+- **Best for:** Systems with limited CPU RAM or when managing many models that won't all fit in memory
+
+### Performance Comparison: Level 1 vs Level 2 vs No Sleep
+
+<div style="margin: 2rem 0;">
+<div id="plotly-sleep-levels-comparison" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  <strong>Model A:</strong> Qwen3-0.6B | <strong>Model B:</strong> Phi-3-vision-128k-instruct<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br>
+  <span style="font-size:0.9rem; margin-top:0.25rem; display:inline-block;">Comparing all three modes: Level 1 (fastest), Level 2 (minimal RAM), No Sleep. Hover for exact timing.</span>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js"></script>
+</div>
+
+**Performance Summary:**
+
+| Mode | Total Time | Wake Time (A/B) | CPU RAM | Best For |
+|------|------------|-----------------|---------|----------|
+| **No Sleep** | 357.1s | N/A (full reload) | Minimal | Single model, no switching |
+| **Level 1** | 112.6s | 0.26s / 0.82s | High (~GB per model) | Frequent switching, ample RAM |
+| **Level 2** | 124.6s | 0.85s / 2.58s | Minimal (~MB per model) | Limited RAM, cost optimization |
+
+**Key Insights:**
+- **Level 1 is fastest** (68% faster than no sleep) but needs significant CPU RAM
+- **Level 2 is nearly as fast** (65% faster than no sleep) with minimal RAM requirements
+- **Level 2 wake is ~3x slower than Level 1** (0.85s vs 0.26s for Qwen3-0.6B) due to weight reload
+- Both sleep modes deliver **massive improvements** over no sleep mode
+
+#### Why Level 2 is Still Faster Than No Sleep Mode
+
+At first glance, this seems counterintuitive: **Level 2 reloads weights from SSD** (just like "No Sleep Mode"), so why is it **23-45x faster overall?**
+
+**The Answer: Weight loading is only ONE of FIVE costs**
+
+When you reload a model without Sleep Mode, you pay all these costs:
+
+| Cost | Level 2 | No Sleep Mode |
+|------|---------|---------------|
+| 1. Weight load (SSD → VRAM) | ❌ Must pay | ❌ Must pay |
+| 2. Process initialization | ✅ **Skipped** | ❌ Must pay |
+| 3. Memory allocator setup | ✅ **Skipped** | ❌ Must pay |
+| 4. CUDA graph capture | ✅ **Skipped** | ❌ Must pay |
+| 5. GPU kernel JIT compilation | ✅ **Preserved (already compiled)** | ❌ Full compilation + warm-up |
+
+**Level 2 Strategy:**
+- Weight reload from SSD (same as No Sleep)
+- **Everything else preserved:** Process state, allocator instance, CUDA graphs, and compiled JIT kernels all intact
+- **No recompilation needed:** Kernels were compiled during initial warmup and remain cached
+- **Average per switch: ~2.6s** (see benchmark data above)
+
+**No Sleep Mode Reality:**
+- Weight reload from SSD (same as Level 2)
+- **Everything else rebuilt:** Process restart + allocator init + graph re-capture
+- **JIT kernels:** Full compilation + explicit warm-up routine (`kernel_warmup()` + dummy runs)
+- **Average per switch: ~48s** (see benchmark data above)
+
+**The benchmark data proves it:** For 5 model switches:
+- **Level 2:** 124.6s total (average ~2.6s per switch)
+- **No Sleep:** 357.1s total (average ~48s per switch)
+
+Even though both reload weights from SSD, Level 2 is **2.9x faster overall** because it preserves the expensive infrastructure (process state, allocator, CUDA graphs) that No Sleep Mode must rebuild from scratch every single time.
+
+### Level 2: Inference Performance
+
+<div style="margin: 2rem 0;">
+<div id="plotly-level2-inference" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Inference time comparison with Sleep Level 2: wake mode vs cold start.<br>
+  <strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js"></script>
+</div>
+
+### Level 2: Model Switching Performance
+
+<div style="margin: 2rem 0;">
+<div id="plotly-level2-switching" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Model switching time with Sleep Level 2: wake from sleep vs cold start.<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js"></script>
+</div>
+
+**Key Observations:**
+
+| Metric | No Sleep | Level 2 | Improvement |
+|--------|----------|---------|-------------|
+| **Total Time (5 switches)** | 357.1s | 124.6s | **65% faster** |
+| **Qwen3-0.6B Switch Time** | 37.6s avg | 0.85s avg | **45x faster** |
+| **Phi-3-vision Switch Time** | 58.1s avg | 2.58s avg | **23x faster** |
+| **Qwen3-0.6B Inference** | 3.67s avg | 0.53s avg | **86% faster** |
+| **Phi-3-vision Inference** | 6.30s avg | 0.76s avg | **88% faster** |
+| **Wake Time vs Level 1** | - | 3-10x slower | Trade CPU RAM for speed |
+
+**When to Use Level 2:**
+- **Limited CPU RAM:** System cannot hold all model weights in CPU memory
+- **Cost Optimization:** Cheaper cloud instances with less CPU RAM
+- **Many Models:** Switching between many models where CPU memory is a constraint
+- **Still Significant Gains:** Even with weight reload, Level 2 is 23-45x faster than no sleep mode
+
+**Level 1 vs Level 2 Comparison:**
+- Level 1: ~0.1-0.8s wake time, needs ~10-100GB+ CPU RAM per model
+- Level 2: ~0.8-2.6s wake time, needs only ~MB CPU RAM per model
+- Both dramatically faster than full reload (~20-100s)
+
+## Ablation Studies
+
+### Impact of Warm-Up on Sleep Mode
+
+Does skipping the warm-up phase affect performance? Warm-up pre-compiles CUDA graphs during initial load, which can take several seconds. Let's compare with and without warm-up.
+
+<div style="margin: 2rem 0;">
+<div id="plotly-ablation-warmup" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  <strong>Model A:</strong> Qwen3-0.6B | <strong>Model B:</strong> Phi-3-vision-128k-instruct<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br>
+  <span style="font-size:0.9rem; margin-top:0.25rem; display:inline-block;">Comparing with warm-up (pre-compiled) vs without warm-up (lazy compilation). Hover for exact timing.</span>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js"></script>
+</div>
+
+**Key Findings:**
+
+| Metric | With Warm-Up | Without Warm-Up | Difference |
+|--------|--------------|-----------------|------------|
+| **Initial Load Time** | 108.7s (includes 8.4s warm-up) | 101.1s (no warm-up) | 7.6s saved initially |
+| **First Inference (A)** | 0.45s | 2.59s | **5.8x slower** without warm-up |
+| **First Inference (B)** | 0.93s | 6.61s | **7.1x slower** without warm-up |
+| **Subsequent Inferences** | 0.43s avg | 0.41s avg | No difference |
+| **Total Time (5 switches)** | 119.5s | 119.0s | Nearly identical |
+
+**Insights:**
+- **Warm-Up Compiles Kernels Once, Benefits All Wake Cycles:** With initial warmup, JIT compilation and CUDA graph capture happen once during load and are preserved across all subsequent sleep/wake cycles
+- **Without Warm-Up, Every Wake-Up Pays Compilation Cost:** The 5-7x slowdown happens on the first inference after **every single wake-up**, not just once
+- **Compiled Kernels Are Preserved Across Sleep/Wake:** After warmup during initial load (8.4s), all subsequent wake-ups have fast first inference (0.45s, 0.93s) proving kernels stay cached
+- **Minimal Warmup Sufficient:** A single 1-token inference is enough to trigger full JIT compilation and CUDA graph capture, making warmup very cheap
+- **Trade Initial Load Time for Consistent Performance:** The 8.4s warmup cost is paid once and amortized across all model switches
+- **Recommendation: Always Use Warm-Up** for production workloads where consistent, fast inference is expected
+
+### Impact of Quantization on Sleep Mode
+
+Does quantization (FP8) affect Sleep Mode performance? We tested the same workload with and without FP8 quantization on A100 GPU.
+
+<div style="margin: 2rem 0;">
+<div id="plotly-ablation-quant" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  <strong>Model A:</strong> Qwen3-0.6B | <strong>Model B:</strong> Phi-3-vision-128k-instruct<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code><br>
+  <span style="font-size:0.9rem; margin-top:0.25rem; display:inline-block;">Comparing BF16 (baseline) vs FP8 quantization. Hover for exact timing.</span>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js"></script>
+</div>
+
+### Ablation: Inference Performance (BF16 vs FP8)
+
+<div style="margin: 2rem 0;">
+<div id="plotly-ablation-inference" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Inference time comparison: BF16 vs FP8 quantization with Sleep Mode.<br>
+  <strong>Inference time = prefill + decode (first request after wake/load).</strong> Each request uses a different question to avoid caching, limited to 100 tokens output.<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js"></script>
+</div>
+
+### Ablation: Model Switching (BF16 vs FP8)
+
+<div style="margin: 2rem 0;">
+<div id="plotly-ablation-switching" style="width: 100%; height: 300px;"></div>
+<div style="text-align:center; color:#666; font-size:0.85rem; margin-top:0.75rem;">
+  Model switching time: BF16 vs FP8 quantization with Sleep Mode.<br>
+  Error bars show min/max variation across multiple runs. Values displayed on bars.<br>
+  GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: <code style="font-size:0.8rem;">cudagraph_mode: FULL_AND_PIECEWISE</code>
+</div>
+<script src="/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js"></script>
+</div>
+
+**Key Findings:**
+
+| Metric | BF16 | FP8 | Improvement |
+|--------|------|-----|-------------|
+| **Total Time (5 switches)** | 108.2s | 113.6s | -5% (slightly slower) |
+| **Qwen3-0.6B Wake Time** | 0.27s avg | 0.18s avg | **33% faster** |
+| **Phi-3-vision Wake Time** | 0.90s avg | 0.78s avg | **13% faster** |
+| **Qwen3-0.6B Inference** | 0.41s avg | 0.44s avg | -7% (slightly slower) |
+| **Phi-3-vision Inference** | 0.81s avg | 0.57s avg | **30% faster** |
+| **Initial Load Time** | 90.5s | 96.9s | -7% (longer warmup) |
+
+**Insights:**
+- **FP8 has faster wake operations** (13-33% faster) due to less memory movement
+- **FP8 improves inference for larger models** (30% faster for Phi-3-vision) but shows minimal difference for tiny models
+- **Initial load takes longer with FP8** due to quantization overhead during warmup
+- **After initial load, FP8 provides smoother switching** with faster wake cycles
+- For workloads with frequent switching, FP8's faster wake times can offset the longer initial load
+
+## Decision Guide: Which Sleep Level to Use?
+
+### Use Sleep Level 1 When:
+- You have sufficient CPU RAM to hold all model weights
+- You need the fastest possible wake times (0.1-6s)
+- You're switching models very frequently (every few seconds/minutes)
+- Inference latency consistency is critical
+
+### Use Sleep Level 2 When:
+- CPU RAM is limited (can't hold all model weights)
+- You're optimizing cloud costs (cheaper instances with less RAM)
+- You have many models to manage (10+)
+
+### Skip Sleep Mode When:
+- You're only using a single model (no switching needed)
+- Model switches are extremely rare (once per day/week)
+- Both models fit simultaneously in GPU memory
+
+## Conclusion
+
+vLLM Sleep Mode transforms multi-model GPU serving from a 30-100 second reload penalty into sub-second switches. The benchmarks speak for themselves:
+
+- **18-200x faster model switching** depending on model size and hardware
+- **61-88% faster inference** for warmed models vs cold starts
+- **65-68% total time savings** across complete workloads
+- **Works at every scale:** 0.6B to 235B parameters, small and large GPUs
+
+The future of LLM serving is multi-model. Sleep Mode makes it practical today.
+
+## Acknowledgements
+
+Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, **Ye Hur Cheong**, **Zhiyao Cen** and **Kaichao You** for developing the Sleep Mode feature and this blog post.
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js
new file mode 100644
index 0000000..0f9b772
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js
@@ -0,0 +1,91 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Ablation inference data: BF16 vs FP8
+  const ablationInferenceData = {
+    "ModelA": {
+      name: "Qwen3-0.6B",
+      bf16: [0.41, 0.4, 0.41],
+      fp8: [0.43, 0.43, 0.45]
+    },
+    "ModelB": {
+      name: "Phi-3-vision-128k",
+      bf16: [0.9, 0.74, 0.8],
+      fp8: [0.69, 0.59, 0.44]
+    }
+  };
+
+  function calcStatsAblInf(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  const modelsAblInf = Object.keys(ablationInferenceData);
+  const bf16StatsInf = modelsAblInf.map(m => calcStatsAblInf(ablationInferenceData[m].bf16));
+  const fp8StatsInf = modelsAblInf.map(m => calcStatsAblInf(ablationInferenceData[m].fp8));
+
+  const bf16TraceInf = {
+    x: modelsAblInf.map(m => ablationInferenceData[m].name),
+    y: bf16StatsInf.map(s => s.mean),
+    name: "BF16",
+    type: "bar",
+    marker: { color: "#1f77b4" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: bf16StatsInf.map(s => s.errorPlus),
+      arrayminus: bf16StatsInf.map(s => s.errorMinus),
+      color: "#0d4a6e",
+      thickness: 2,
+      width: 6
+    },
+    text: bf16StatsInf.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#1f77b4", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>BF16: %{y:.2f}s<extra></extra>"
+  };
+
+  const fp8TraceInf = {
+    x: modelsAblInf.map(m => ablationInferenceData[m].name),
+    y: fp8StatsInf.map(s => s.mean),
+    name: "FP8",
+    type: "bar",
+    marker: { color: "#ff7f0e" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: fp8StatsInf.map(s => s.errorPlus),
+      arrayminus: fp8StatsInf.map(s => s.errorMinus),
+      color: "#cc6600",
+      thickness: 2,
+      width: 6
+    },
+    text: fp8StatsInf.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#ff7f0e", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>FP8: %{y:.2f}s<extra></extra>"
+  };
+
+  Plotly.newPlot("plotly-ablation-inference", [bf16TraceInf, fp8TraceInf], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Inference Time (seconds)",
+      range: [0, Math.max(...bf16StatsInf.map(s => s.mean + s.errorPlus), ...fp8StatsInf.map(s => s.mean + s.errorPlus)) * 1.25]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    }
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js
new file mode 100644
index 0000000..85a4ec8
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js
@@ -0,0 +1,140 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Ablation study: BF16 vs FP8 quantization
+  const timingDataAblation = {
+    "Sleep Mode (BF16)": [
+      { event: "A Model Load", duration: 32.56 },
+      { event: "A Model Warm Up", duration: 2.69 },
+      { event: "B Model Load", duration: 57.96 },
+      { event: "B Model Warm Up", duration: 5.92 },
+      { event: "A Model Wake up", duration: 0.28 },
+      { event: "A Model Prompt", duration: 0.41 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.89 },
+      { event: "B Model Prompt", duration: 0.9 },
+      { event: "B Model Sleep", duration: 0.48 },
+      { event: "A Model Wake up", duration: 0.27 },
+      { event: "A Model Prompt", duration: 0.4 },
+      { event: "A Model Sleep", duration: 0.1 },
+      { event: "B Model Wake Up", duration: 0.93 },
+      { event: "B Model Prompt", duration: 0.74 },
+      { event: "B Model Sleep", duration: 0.5 },
+      { event: "A Model Wake up", duration: 0.27 },
+      { event: "A Model Prompt", duration: 0.41 },
+      { event: "A Model Sleep", duration: 0.1 },
+      { event: "B Model Wake Up", duration: 0.88 },
+      { event: "B Model Prompt", duration: 0.8 }
+    ],
+    "Sleep Mode (FP8)": [
+      { event: "A Model Load", duration: 37.71 },
+      { event: "A Model Warm Up", duration: 2.34 },
+      { event: "B Model Load", duration: 57.79 },
+      { event: "B Model Warm Up", duration: 6.37 },
+      { event: "A Model Wake up", duration: 0.18 },
+      { event: "A Model Prompt", duration: 0.43 },
+      { event: "A Model Sleep", duration: 0.06 },
+      { event: "B Model Wake Up", duration: 0.79 },
+      { event: "B Model Prompt", duration: 0.69 },
+      { event: "B Model Sleep", duration: 0.31 },
+      { event: "A Model Wake up", duration: 0.19 },
+      { event: "A Model Prompt", duration: 0.43 },
+      { event: "A Model Sleep", duration: 0.06 },
+      { event: "B Model Wake Up", duration: 0.77 },
+      { event: "B Model Prompt", duration: 0.59 },
+      { event: "B Model Sleep", duration: 0.31 },
+      { event: "A Model Wake up", duration: 0.16 },
+      { event: "A Model Prompt", duration: 0.45 },
+      { event: "A Model Sleep", duration: 0.07 },
+      { event: "B Model Wake Up", duration: 0.78 },
+      { event: "B Model Prompt", duration: 0.44 }
+    ]
+  };
+
+  // Convert to segment format
+  function createSegmentsAblation(timingData) {
+    const segments = [];
+
+    Object.entries(timingData).forEach(([scenario, events]) => {
+      let cumulativeTime = 0;
+
+      events.forEach(({ event, duration }) => {
+        const [who, ...stageParts] = event.split(' ');
+        const stage = stageParts.join(' ');
+
+        let action, category;
+        if (stage.includes('Load')) {
+          action = 'Load';
+          category = `${who} Load`;
+        } else if (stage.includes('Wake')) {
+          action = 'Wake';
+          category = `${who} Wake`;
+        } else if (stage.includes('Prompt')) {
+          action = 'Prompt';
+          category = `${who} Prompt`;
+        } else if (stage.includes('Sleep')) {
+          action = 'Sleep';
+          category = `${who} Sleep`;
+        } else if (stage.includes('Warm')) {
+          action = 'Load';
+          category = `${who} Load`;
+        }
+
+        segments.push({
+          scenario,
+          who,
+          stage,
+          action,
+          start: cumulativeTime,
+          end: cumulativeTime + duration,
+          duration,
+          category
+        });
+
+        cumulativeTime += duration;
+      });
+    });
+
+    return segments;
+  }
+
+  const segmentsAblation = createSegmentsAblation(timingDataAblation);
+  const colorMapAblation = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+  const categoriesAblation = Object.keys(colorMapAblation);
+
+  const xAblation = segmentsAblation.map(d => d.duration);
+  const baseAblation = segmentsAblation.map(d => d.start);
+  const yAblation = segmentsAblation.map(d => d.scenario);
+  const colorsAblation = segmentsAblation.map(d => colorMapAblation[d.category]);
+  const customAblation = segmentsAblation.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+  const barsAblation = {
+    type: "bar",
+    orientation: "h",
+    x: xAblation, base: baseAblation, y: yAblation,
+    marker: { color: colorsAblation, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+    hovertemplate:
+      "<b>%{customdata[0]}</b><br>%{customdata[1]} — %{customdata[2]}<br>"+
+      "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s<br>"+
+      "<b>%{x:.2f}s</b><extra></extra>",
+    customdata: customAblation,
+    showlegend: false
+  };
+
+  const legendTracesAblation = categoriesAblation.map(name => ({
+    type: "scatter", mode: "markers", x:[null], y:[null],
+    name, marker: {color: colorMapAblation[name], size: 10},
+    hoverinfo:"skip", showlegend:true
+  }));
+
+  Plotly.newPlot("plotly-ablation-quant", [barsAblation, ...legendTracesAblation], {
+    barmode: "overlay",
+    bargap: 0.05,
+    margin: {l: 140, r: 30, t: 20, b: 40},
+    xaxis: { title: "Time (seconds)", range: [0, 115] },
+    yaxis: {
+      categoryorder: "array",
+      categoryarray: ["Sleep Mode (FP8)", "Sleep Mode (BF16)"]
+    },
+    hovermode: "closest",
+    dragmode: "pan"
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js
new file mode 100644
index 0000000..e2f0f94
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js
@@ -0,0 +1,105 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Ablation switching data: BF16 vs FP8
+  const ablationSwitchingData = {
+    "ModelA": {
+      name: "Qwen3-0.6B",
+      bf16: [0.28, 0.27, 0.27],
+      fp8: [0.18, 0.19, 0.16]
+    },
+    "ModelB": {
+      name: "Phi-3-vision-128k",
+      bf16: [0.89, 0.93, 0.88],
+      fp8: [0.79, 0.77, 0.78]
+    }
+  };
+
+  function calcStatsAblSwitch(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  const modelsAblSwitch = Object.keys(ablationSwitchingData);
+  const bf16StatsSwitch = modelsAblSwitch.map(m => calcStatsAblSwitch(ablationSwitchingData[m].bf16));
+  const fp8StatsSwitch = modelsAblSwitch.map(m => calcStatsAblSwitch(ablationSwitchingData[m].fp8));
+
+  const bf16TraceSwitch = {
+    x: modelsAblSwitch.map(m => ablationSwitchingData[m].name),
+    y: bf16StatsSwitch.map(s => s.mean),
+    name: "BF16",
+    type: "bar",
+    marker: { color: "#1f77b4" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: bf16StatsSwitch.map(s => s.errorPlus),
+      arrayminus: bf16StatsSwitch.map(s => s.errorMinus),
+      color: "#0d4a6e",
+      thickness: 2,
+      width: 6
+    },
+    text: bf16StatsSwitch.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#1f77b4", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>BF16: %{y:.2f}s<extra></extra>"
+  };
+
+  const fp8TraceSwitch = {
+    x: modelsAblSwitch.map(m => ablationSwitchingData[m].name),
+    y: fp8StatsSwitch.map(s => s.mean),
+    name: "FP8",
+    type: "bar",
+    marker: { color: "#ff7f0e" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: fp8StatsSwitch.map(s => s.errorPlus),
+      arrayminus: fp8StatsSwitch.map(s => s.errorMinus),
+      color: "#cc6600",
+      thickness: 2,
+      width: 6
+    },
+    text: fp8StatsSwitch.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#ff7f0e", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>FP8: %{y:.2f}s<extra></extra>"
+  };
+
+  // Calculate speedup percentages for annotation
+  const speedupsSwitchAbl = bf16StatsSwitch.map((bf16, i) => {
+    const reduction = ((bf16.mean - fp8StatsSwitch[i].mean) / bf16.mean * 100).toFixed(0);
+    return reduction;
+  });
+
+  Plotly.newPlot("plotly-ablation-switching", [bf16TraceSwitch, fp8TraceSwitch], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Wake Time (seconds)",
+      range: [0, Math.max(...bf16StatsSwitch.map(s => s.mean + s.errorPlus)) * 1.3]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: modelsAblSwitch.map((m, i) => ({
+      x: ablationSwitchingData[m].name,
+      y: bf16StatsSwitch[i].mean + bf16StatsSwitch[i].errorPlus + 0.07,
+      text: `<b>${speedupsSwitchAbl[i]}% faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#ff7f0e", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js
new file mode 100644
index 0000000..2469df2
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js
@@ -0,0 +1,138 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Ablation study: With vs Without Warm-Up
+  const timingDataWarmup = {
+    "With Warm-Up": [
+      { event: "A Model Load", duration: 37.65 },
+      { event: "A Model Warm Up", duration: 2.39 },
+      { event: "B Model Load", duration: 62.69 },
+      { event: "B Model Warm Up", duration: 6 },
+      { event: "A Model Wake up", duration: 0.24 },
+      { event: "A Model Prompt", duration: 0.45 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.89 },
+      { event: "B Model Prompt", duration: 0.93 },
+      { event: "B Model Sleep", duration: 0.47 },
+      { event: "A Model Wake up", duration: 0.23 },
+      { event: "A Model Prompt", duration: 0.43 },
+      { event: "A Model Sleep", duration: 0.1 },
+      { event: "B Model Wake Up", duration: 0.87 },
+      { event: "B Model Prompt", duration: 0.73 },
+      { event: "B Model Sleep", duration: 0.46 },
+      { event: "A Model Wake up", duration: 0.23 },
+      { event: "A Model Prompt", duration: 0.46 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.85 },
+      { event: "B Model Prompt", duration: 0.73 }
+    ],
+    "Without Warm-Up": [
+      { event: "A Model Load", duration: 37.91 },
+      { event: "B Model Load", duration: 63.16 },
+      { event: "A Model Wake up", duration: 0.24 },
+      { event: "A Model Prompt", duration: 2.59 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.91 },
+      { event: "B Model Prompt", duration: 6.61 },
+      { event: "B Model Sleep", duration: 0.44 },
+      { event: "A Model Wake up", duration: 0.26 },
+      { event: "A Model Prompt", duration: 0.41 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.87 },
+      { event: "B Model Prompt", duration: 0.7 },
+      { event: "B Model Sleep", duration: 0.43 },
+      { event: "A Model Wake up", duration: 0.27 },
+      { event: "A Model Prompt", duration: 0.42 },
+      { event: "A Model Sleep", duration: 0.1 },
+      { event: "B Model Wake Up", duration: 0.86 },
+      { event: "B Model Prompt", duration: 0.7 }
+    ]
+  };
+
+  // Convert to segment format
+  function createSegmentsWarmup(timingData) {
+    const segments = [];
+
+    Object.entries(timingData).forEach(([scenario, events]) => {
+      let cumulativeTime = 0;
+
+      events.forEach(({ event, duration }) => {
+        const [who, ...stageParts] = event.split(' ');
+        const stage = stageParts.join(' ');
+
+        let action, category;
+        if (stage.includes('Load')) {
+          action = 'Load';
+          category = `${who} Load`;
+        } else if (stage.includes('Wake')) {
+          action = 'Wake';
+          category = `${who} Wake`;
+        } else if (stage.includes('Prompt')) {
+          action = 'Prompt';
+          category = `${who} Prompt`;
+        } else if (stage.includes('Sleep')) {
+          action = 'Sleep';
+          category = `${who} Sleep`;
+        } else if (stage.includes('Warm')) {
+          action = 'Load';
+          category = `${who} Load`;
+        }
+
+        segments.push({
+          scenario,
+          who,
+          stage,
+          action,
+          start: cumulativeTime,
+          end: cumulativeTime + duration,
+          duration,
+          category
+        });
+
+        cumulativeTime += duration;
+      });
+    });
+
+    return segments;
+  }
+
+  const segmentsWarmup = createSegmentsWarmup(timingDataWarmup);
+  const colorMapWarmup = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+  const categoriesWarmup = Object.keys(colorMapWarmup);
+
+  const xWarmup = segmentsWarmup.map(d => d.duration);
+  const baseWarmup = segmentsWarmup.map(d => d.start);
+  const yWarmup = segmentsWarmup.map(d => d.scenario);
+  const colorsWarmup = segmentsWarmup.map(d => colorMapWarmup[d.category]);
+  const customWarmup = segmentsWarmup.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+  const barsWarmup = {
+    type: "bar",
+    orientation: "h",
+    x: xWarmup, base: baseWarmup, y: yWarmup,
+    marker: { color: colorsWarmup, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+    hovertemplate:
+      "<b>%{customdata[0]}</b><br>%{customdata[1]} — %{customdata[2]}<br>"+
+      "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s<br>"+
+      "<b>%{x:.2f}s</b><extra></extra>",
+    customdata: customWarmup,
+    showlegend: false
+  };
+
+  const legendTracesWarmup = categoriesWarmup.map(name => ({
+    type: "scatter", mode: "markers", x:[null], y:[null],
+    name, marker: {color: colorMapWarmup[name], size: 10},
+    hoverinfo:"skip", showlegend:true
+  }));
+
+  Plotly.newPlot("plotly-ablation-warmup", [barsWarmup, ...legendTracesWarmup], {
+    barmode: "overlay",
+    bargap: 0.05,
+    margin: {l: 140, r: 30, t: 20, b: 40},
+    xaxis: { title: "Time (seconds)", range: [0, 120] },
+    yaxis: {
+      categoryorder: "array",
+      categoryarray: ["Without Warm-Up", "With Warm-Up"]
+    },
+    hovermode: "closest",
+    dragmode: "pan"
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js
new file mode 100644
index 0000000..5f1f803
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // A4000 Inference data
+  const inferenceDataA4000 = {
+    "ModelA": {
+      name: "Qwen3-0.6B",
+      wake: [0.44, 0.43, 0.43],
+      cold: [2.64, 2.5, 2.63]
+    },
+    "ModelB": {
+      name: "Phi-3-vision-128k(4B)",
+      wake: [2.04, 1.73, 1.61],
+      cold: [9.78, 9.01, 9.79]
+    }
+  };
+
+  function calcStatsInfA4000(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  const modelsInfA4000 = Object.keys(inferenceDataA4000);
+  const wakeStatsInfA4000 = modelsInfA4000.map(m => calcStatsInfA4000(inferenceDataA4000[m].wake));
+  const coldStatsInfA4000 = modelsInfA4000.map(m => calcStatsInfA4000(inferenceDataA4000[m].cold));
+
+  const wakeTraceInfA4000 = {
+    x: modelsInfA4000.map(m => inferenceDataA4000[m].name),
+    y: wakeStatsInfA4000.map(s => s.mean),
+    name: "Wake Mode (Warmed Up)",
+    type: "bar",
+    marker: { color: "#2ca02c" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: wakeStatsInfA4000.map(s => s.errorPlus),
+      arrayminus: wakeStatsInfA4000.map(s => s.errorMinus),
+      color: "#1a5e1a",
+      thickness: 2,
+      width: 6
+    },
+    text: wakeStatsInfA4000.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Wake Mode: %{y:.2f}s<extra></extra>"
+  };
+
+  const coldTraceInfA4000 = {
+    x: modelsInfA4000.map(m => inferenceDataA4000[m].name),
+    y: coldStatsInfA4000.map(s => s.mean),
+    name: "Cold Start (Just Loaded)",
+    type: "bar",
+    marker: { color: "#d62728" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: coldStatsInfA4000.map(s => s.errorPlus),
+      arrayminus: coldStatsInfA4000.map(s => s.errorMinus),
+      color: "#8b1518",
+      thickness: 2,
+      width: 6
+    },
+    text: coldStatsInfA4000.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#d62728", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Cold Start: %{y:.2f}s<extra></extra>"
+  };
+
+  const speedupsInfA4000 = wakeStatsInfA4000.map((w, i) => {
+    const reduction = ((coldStatsInfA4000[i].mean - w.mean) / coldStatsInfA4000[i].mean * 100).toFixed(0);
+    return reduction;
+  });
+
+  Plotly.newPlot("plotly-inference-a4000", [wakeTraceInfA4000, coldTraceInfA4000], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Inference Time (seconds)",
+      range: [0, Math.max(...coldStatsInfA4000.map(s => s.mean + s.errorPlus)) * 1.2]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: modelsInfA4000.map((m, i) => ({
+      x: inferenceDataA4000[m].name,
+      y: coldStatsInfA4000[i].mean + coldStatsInfA4000[i].errorPlus + 0.6,
+      text: `<b>${speedupsInfA4000[i]}% faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#2ca02c", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js
new file mode 100644
index 0000000..80afa76
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js
@@ -0,0 +1,107 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Raw data: Wake Inference Time vs Cold Start Inference Time
+  const inferenceData = {
+    "ModelA": {
+      name: "Qwen3-235B-A22B (TP=4)",
+      wake: [1.8, 1.7, 0.92],
+      cold: [3.8, 3.7, 3.72]
+    },
+    "ModelB": {
+      name: "Qwen3-Coder-30B (TP=1)",
+      wake: [1.0, 0.93, 0.54],
+      cold: [3.7, 2.9, 2.45]
+    }
+  };
+
+  // Calculate mean and error bars for each model
+  function calcStats(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  // Prepare traces for both models
+  const models = Object.keys(inferenceData);
+  const wakeStats = models.map(m => calcStats(inferenceData[m].wake));
+  const coldStats = models.map(m => calcStats(inferenceData[m].cold));
+
+  const wakeTrace = {
+    x: models.map(m => inferenceData[m].name),
+    y: wakeStats.map(s => s.mean),
+    name: "Wake Mode (Warmed Up)",
+    type: "bar",
+    marker: { color: "#2ca02c" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: wakeStats.map(s => s.errorPlus),
+      arrayminus: wakeStats.map(s => s.errorMinus),
+      color: "#1a5e1a",
+      thickness: 2,
+      width: 6
+    },
+    text: wakeStats.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Wake Mode: %{y:.2f}s<extra></extra>"
+  };
+
+  const coldTrace = {
+    x: models.map(m => inferenceData[m].name),
+    y: coldStats.map(s => s.mean),
+    name: "Cold Start (Just Loaded)",
+    type: "bar",
+    marker: { color: "#d62728" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: coldStats.map(s => s.errorPlus),
+      arrayminus: coldStats.map(s => s.errorMinus),
+      color: "#8b1518",
+      thickness: 2,
+      width: 6
+    },
+    text: coldStats.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#d62728", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Cold Start: %{y:.2f}s<extra></extra>"
+  };
+
+  // Calculate speedup percentages for annotation
+  const speedups = wakeStats.map((w, i) => {
+    const reduction = ((coldStats[i].mean - w.mean) / coldStats[i].mean * 100).toFixed(0);
+    return reduction;
+  });
+
+  Plotly.newPlot("plotly-inference-comparison", [wakeTrace, coldTrace], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Inference Time (seconds)",
+      range: [0, Math.max(...coldStats.map(s => s.mean + s.errorPlus)) * 1.2]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: models.map((m, i) => ({
+      x: inferenceData[m].name,
+      y: coldStats[i].mean + coldStats[i].errorPlus + 0.3,
+      text: `<b>${speedups[i]}% faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#2ca02c", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js b/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js
new file mode 100644
index 0000000..48082f2
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Level 2 inference data
+  const level2InferenceData = {
+    "ModelA": {
+      name: "Qwen3-0.6B",
+      wake: [0.68, 0.46, 0.44],
+      cold: [4.66, 3.8, 2.56]
+    },
+    "ModelB": {
+      name: "Phi-3-vision-128k",
+      wake: [0.78, 0.77, 0.72],
+      cold: [6.55, 6.21, 6.15]
+    }
+  };
+
+  function calcStatsLevel2Inf(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  const modelsLevel2Inf = Object.keys(level2InferenceData);
+  const wakeStatsLevel2Inf = modelsLevel2Inf.map(m => calcStatsLevel2Inf(level2InferenceData[m].wake));
+  const coldStatsLevel2Inf = modelsLevel2Inf.map(m => calcStatsLevel2Inf(level2InferenceData[m].cold));
+
+  const wakeTraceLevel2Inf = {
+    x: modelsLevel2Inf.map(m => level2InferenceData[m].name),
+    y: wakeStatsLevel2Inf.map(s => s.mean),
+    name: "Wake Mode (Level 2)",
+    type: "bar",
+    marker: { color: "#2ca02c" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: wakeStatsLevel2Inf.map(s => s.errorPlus),
+      arrayminus: wakeStatsLevel2Inf.map(s => s.errorMinus),
+      color: "#1a5e1a",
+      thickness: 2,
+      width: 6
+    },
+    text: wakeStatsLevel2Inf.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Wake Mode: %{y:.2f}s<extra></extra>"
+  };
+
+  const coldTraceLevel2Inf = {
+    x: modelsLevel2Inf.map(m => level2InferenceData[m].name),
+    y: coldStatsLevel2Inf.map(s => s.mean),
+    name: "Cold Start",
+    type: "bar",
+    marker: { color: "#d62728" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: coldStatsLevel2Inf.map(s => s.errorPlus),
+      arrayminus: coldStatsLevel2Inf.map(s => s.errorMinus),
+      color: "#8b1518",
+      thickness: 2,
+      width: 6
+    },
+    text: coldStatsLevel2Inf.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#d62728", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Cold Start: %{y:.2f}s<extra></extra>"
+  };
+
+  const speedupsLevel2Inf = wakeStatsLevel2Inf.map((w, i) => {
+    const reduction = ((coldStatsLevel2Inf[i].mean - w.mean) / coldStatsLevel2Inf[i].mean * 100).toFixed(0);
+    return reduction;
+  });
+
+  Plotly.newPlot("plotly-level2-inference", [wakeTraceLevel2Inf, coldTraceLevel2Inf], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Inference Time (seconds)",
+      range: [0, Math.max(...coldStatsLevel2Inf.map(s => s.mean + s.errorPlus)) * 1.2]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: modelsLevel2Inf.map((m, i) => ({
+      x: level2InferenceData[m].name,
+      y: coldStatsLevel2Inf[i].mean + coldStatsLevel2Inf[i].errorPlus + 0.4,
+      text: `<b>${speedupsLevel2Inf[i]}% faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#2ca02c", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js b/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js
new file mode 100644
index 0000000..87d7c18
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Level 2 switching data
+  const level2SwitchingData = {
+    "ModelA": {
+      name: "Qwen3-0.6B",
+      wake: [0.91, 0.78, 0.85],
+      cold: [38.53, 37.21, 38.15]
+    },
+    "ModelB": {
+      name: "Phi-3-vision-128k",
+      wake: [2.55, 2.62, 2.58],
+      cold: [58.52, 57.65, 58.2]
+    }
+  };
+
+  function calcStatsLevel2Switch(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  const modelsLevel2Switch = Object.keys(level2SwitchingData);
+  const wakeStatsLevel2Switch = modelsLevel2Switch.map(m => calcStatsLevel2Switch(level2SwitchingData[m].wake));
+  const coldStatsLevel2Switch = modelsLevel2Switch.map(m => calcStatsLevel2Switch(level2SwitchingData[m].cold));
+
+  const wakeTraceLevel2Switch = {
+    x: modelsLevel2Switch.map(m => level2SwitchingData[m].name),
+    y: wakeStatsLevel2Switch.map(s => s.mean),
+    name: "Wake from Sleep (Level 2)",
+    type: "bar",
+    marker: { color: "#2ca02c" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: wakeStatsLevel2Switch.map(s => s.errorPlus),
+      arrayminus: wakeStatsLevel2Switch.map(s => s.errorMinus),
+      color: "#1a5e1a",
+      thickness: 2,
+      width: 6
+    },
+    text: wakeStatsLevel2Switch.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Wake Time: %{y:.2f}s<extra></extra>"
+  };
+
+  const coldTraceLevel2Switch = {
+    x: modelsLevel2Switch.map(m => level2SwitchingData[m].name),
+    y: coldStatsLevel2Switch.map(s => s.mean),
+    name: "Cold Start (Fresh Load)",
+    type: "bar",
+    marker: { color: "#d62728" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: coldStatsLevel2Switch.map(s => s.errorPlus),
+      arrayminus: coldStatsLevel2Switch.map(s => s.errorMinus),
+      color: "#8b1518",
+      thickness: 2,
+      width: 6
+    },
+    text: coldStatsLevel2Switch.map(s => s.mean.toFixed(1) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#d62728", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Cold Start: %{y:.2f}s<extra></extra>"
+  };
+
+  const speedupsLevel2Switch = wakeStatsLevel2Switch.map((w, i) => {
+    const speedup = (coldStatsLevel2Switch[i].mean / w.mean).toFixed(0);
+    return speedup;
+  });
+
+  Plotly.newPlot("plotly-level2-switching", [wakeTraceLevel2Switch, coldTraceLevel2Switch], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Switching Time (seconds)",
+      range: [0, Math.max(...coldStatsLevel2Switch.map(s => s.mean + s.errorPlus)) * 1.15]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: modelsLevel2Switch.map((m, i) => ({
+      x: level2SwitchingData[m].name,
+      y: coldStatsLevel2Switch[i].mean + coldStatsLevel2Switch[i].errorPlus + 3,
+      text: `<b>${speedupsLevel2Switch[i]}x faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#2ca02c", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js
new file mode 100644
index 0000000..9de47fc
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js
@@ -0,0 +1,154 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Sleep Levels Comparison timing data
+  const timingDataLevelsComp = {
+    "Sleep Mode (Level 1)": [
+      { event: "A Model Load", duration: 36.27 },
+      { event: "A Model Warm Up", duration: 2.53 },
+      { event: "B Model Load", duration: 58.24 },
+      { event: "B Model Warm Up", duration: 5.95 },
+      { event: "A Model Wake up", duration: 0.25 },
+      { event: "A Model Prompt", duration: 0.43 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.82 },
+      { event: "B Model Prompt", duration: 0.86 },
+      { event: "B Model Sleep", duration: 0.41 },
+      { event: "A Model Wake up", duration: 0.28 },
+      { event: "A Model Prompt", duration: 0.41 },
+      { event: "A Model Sleep", duration: 0.1 },
+      { event: "B Model Wake Up", duration: 0.82 },
+      { event: "B Model Prompt", duration: 0.71 },
+      { event: "B Model Sleep", duration: 0.42 },
+      { event: "A Model Wake up", duration: 0.25 },
+      { event: "A Model Prompt", duration: 0.45 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 0.83 },
+      { event: "B Model Prompt", duration: 0.71 }
+    ],
+    "Sleep Mode (Level 2)": [
+      { event: "A Model Load", duration: 38.55 },
+      { event: "A Model Warm Up", duration: 2.53 },
+      { event: "B Model Load", duration: 61.23 },
+      { event: "B Model Warm Up", duration: 5.75 },
+      { event: "A Model Wake up", duration: 0.91 },
+      { event: "A Model Prompt", duration: 0.68 },
+      { event: "A Model Sleep", duration: 0.13 },
+      { event: "B Model Wake Up", duration: 2.55 },
+      { event: "B Model Prompt", duration: 0.78 },
+      { event: "B Model Sleep", duration: 0.46 },
+      { event: "A Model Wake up", duration: 0.78 },
+      { event: "A Model Prompt", duration: 0.46 },
+      { event: "A Model Sleep", duration: 0.12 },
+      { event: "B Model Wake Up", duration: 2.62 },
+      { event: "B Model Prompt", duration: 0.77 },
+      { event: "B Model Sleep", duration: 0.45 },
+      { event: "A Model Wake up", duration: 0.85 },
+      { event: "A Model Prompt", duration: 0.44 },
+      { event: "A Model Sleep", duration: 0.09 },
+      { event: "B Model Wake Up", duration: 2.58 },
+      { event: "B Model Prompt", duration: 0.72 }
+    ],
+    "WITHOUT Sleep Mode": [
+      { event: "A Model Load", duration: 38.53 },
+      { event: "A Model Prompt", duration: 4.66 },
+      { event: "B Model Load", duration: 58.52 },
+      { event: "B Model Prompt", duration: 6.55 },
+      { event: "A Model Load", duration: 37.21 },
+      { event: "A Model Prompt", duration: 3.8 },
+      { event: "B Model Load", duration: 57.65 },
+      { event: "B Model Prompt", duration: 6.21 },
+      { event: "A Model Load", duration: 38.15 },
+      { event: "A Model Prompt", duration: 2.56 },
+      { event: "B Model Load", duration: 58.2 },
+      { event: "B Model Prompt", duration: 6.15 }
+    ]
+  };
+
+  // Convert to segment format
+  function createSegmentsLevelsComp(timingData) {
+    const segments = [];
+
+    Object.entries(timingData).forEach(([scenario, events]) => {
+      let cumulativeTime = 0;
+
+      events.forEach(({ event, duration }) => {
+        const [who, ...stageParts] = event.split(' ');
+        const stage = stageParts.join(' ');
+
+        let action, category;
+        if (stage.includes('Load')) {
+          action = 'Load';
+          category = `${who} Load`;
+        } else if (stage.includes('Wake')) {
+          action = 'Wake';
+          category = `${who} Wake`;
+        } else if (stage.includes('Prompt')) {
+          action = 'Prompt';
+          category = `${who} Prompt`;
+        } else if (stage.includes('Sleep')) {
+          action = 'Sleep';
+          category = `${who} Sleep`;
+        } else if (stage.includes('Warm')) {
+          action = 'Load';
+          category = `${who} Load`;
+        }
+
+        segments.push({
+          scenario,
+          who,
+          stage,
+          action,
+          start: cumulativeTime,
+          end: cumulativeTime + duration,
+          duration,
+          category
+        });
+
+        cumulativeTime += duration;
+      });
+    });
+
+    return segments;
+  }
+
+  const segmentsLevelsComp = createSegmentsLevelsComp(timingDataLevelsComp);
+  const colorMapLevelsComp = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+  const categoriesLevelsComp = Object.keys(colorMapLevelsComp);
+
+  const xLevelsComp = segmentsLevelsComp.map(d => d.duration);
+  const baseLevelsComp = segmentsLevelsComp.map(d => d.start);
+  const yLevelsComp = segmentsLevelsComp.map(d => d.scenario);
+  const colorsLevelsComp = segmentsLevelsComp.map(d => colorMapLevelsComp[d.category]);
+  const customLevelsComp = segmentsLevelsComp.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+  const barsLevelsComp = {
+    type: "bar",
+    orientation: "h",
+    x: xLevelsComp, base: baseLevelsComp, y: yLevelsComp,
+    marker: { color: colorsLevelsComp, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+    hovertemplate:
+      "<b>%{customdata[0]}</b><br>%{customdata[1]} — %{customdata[2]}<br>"+
+      "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s<br>"+
+      "<b>%{x:.2f}s</b><extra></extra>",
+    customdata: customLevelsComp,
+    showlegend: false
+  };
+
+  const legendTracesLevelsComp = categoriesLevelsComp.map(name => ({
+    type: "scatter", mode: "markers", x:[null], y:[null],
+    name, marker: {color: colorMapLevelsComp[name], size: 10},
+    hoverinfo:"skip", showlegend:true
+  }));
+
+  Plotly.newPlot("plotly-sleep-levels-comparison", [barsLevelsComp, ...legendTracesLevelsComp], {
+    barmode: "overlay",
+    bargap: 0.05,
+    margin: {l: 160, r: 30, t: 20, b: 40},
+    xaxis: { title: "Time (seconds)", range: [0, 365] },
+    yaxis: {
+      categoryorder: "array",
+      categoryarray: ["WITHOUT Sleep Mode", "Sleep Mode (Level 2)", "Sleep Mode (Level 1)"]
+    },
+    hovermode: "closest",
+    dragmode: "pan"
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js
new file mode 100644
index 0000000..d029412
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js
@@ -0,0 +1,134 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // A4000 GPU timing data
+  const timingDataA4000 = {
+    "WITH Sleep Mode (L1)": [
+      { event: "A Model Load", duration: 21.01 },
+      { event: "A Model Warm up", duration: 2.49 },
+      { event: "B Model Load", duration: 46.01 },
+      { event: "B Model Warm up", duration: 7.37 },
+      { event: "A Model Wake up", duration: 0.11 },
+      { event: "A Model Prompt", duration: 0.44 },
+      { event: "A Model Sleep", duration: 0.13 },
+      { event: "B Model Wake Up", duration: 0.8 },
+      { event: "B Model Prompt", duration: 2.04 },
+      { event: "B Model Sleep", duration: 0.68 },
+      { event: "A Model Wake up", duration: 0.1 },
+      { event: "A Model Prompt", duration: 0.43 },
+      { event: "A Model Sleep", duration: 0.13 },
+      { event: "B Model Wake Up", duration: 0.8 },
+      { event: "B Model Prompt", duration: 1.73 },
+      { event: "B Model Sleep", duration: 0.68 },
+      { event: "A Model Wake up", duration: 0.1 },
+      { event: "A Model Prompt", duration: 0.43 },
+      { event: "A Model Sleep", duration: 0.13 },
+      { event: "B Model Wake Up", duration: 0.8 },
+      { event: "B Model Prompt", duration: 1.61 }
+    ],
+    "WITHOUT Sleep Mode": [
+      { event: "A Model Load", duration: 21.04 },
+      { event: "A Model Prompt", duration: 2.64 },
+      { event: "B Model Load", duration: 46.01 },
+      { event: "B Model Prompt", duration: 9.78 },
+      { event: "A Model Load", duration: 20.98 },
+      { event: "A Model Prompt", duration: 2.5 },
+      { event: "B Model Load", duration: 46.02 },
+      { event: "B Model Prompt", duration: 9.01 },
+      { event: "A Model Load", duration: 20.98 },
+      { event: "A Model Prompt", duration: 2.63 },
+      { event: "B Model Load", duration: 46.02 },
+      { event: "B Model Prompt", duration: 9.79 }
+    ]
+  };
+
+  // Convert simplified data to full segment format
+  function createSegmentsA4000(timingData) {
+    const segments = [];
+
+    Object.entries(timingData).forEach(([scenario, events]) => {
+      let cumulativeTime = 0;
+
+      events.forEach(({ event, duration }) => {
+        const [who, ...stageParts] = event.split(' ');
+        const stage = stageParts.join(' ');
+
+        // Determine action and category from stage
+        let action, category;
+        if (stage.includes('Load')) {
+          action = 'Load';
+          category = `${who} Load`;
+        } else if (stage.includes('Wake')) {
+          action = 'Wake';
+          category = `${who} Wake`;
+        } else if (stage.includes('Prompt')) {
+          action = 'Prompt';
+          category = `${who} Prompt`;
+        } else if (stage.includes('Sleep')) {
+          action = 'Sleep';
+          category = `${who} Sleep`;
+        } else if (stage.includes('Warm up')) {
+          action = 'Load';
+          category = `${who} Load`;
+        }
+
+        segments.push({
+          scenario,
+          who,
+          stage,
+          action,
+          start: cumulativeTime,
+          end: cumulativeTime + duration,
+          duration,
+          category
+        });
+
+        cumulativeTime += duration;
+      });
+    });
+
+    return segments;
+  }
+
+  const segmentsA4000 = createSegmentsA4000(timingDataA4000);
+  const colorMapA4000 = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+  const categoriesA4000 = Object.keys(colorMapA4000);
+
+  // Build arrays for a single stacked-horizontal bar trace using "base"
+  const xA4000 = segmentsA4000.map(d => d.duration);
+  const baseA4000 = segmentsA4000.map(d => d.start);
+  const yA4000 = segmentsA4000.map(d => d.scenario);
+  const colorsA4000 = segmentsA4000.map(d => colorMapA4000[d.category]);
+  const customA4000 = segmentsA4000.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+  const barsA4000 = {
+    type: "bar",
+    orientation: "h",
+    x: xA4000, base: baseA4000, y: yA4000,
+    marker: { color: colorsA4000, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+    hovertemplate:
+      "<b>%{customdata[0]}</b><br>%{customdata[1]} — %{customdata[2]}<br>"+
+      "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s<br>"+
+      "<b>%{x:.2f}s</b><extra></extra>",
+    customdata: customA4000,
+    showlegend: false
+  };
+
+  // Legend-only dummies to produce a clean 8-item legend
+  const legendTracesA4000 = categoriesA4000.map(name => ({
+    type: "scatter", mode: "markers", x:[null], y:[null],
+    name, marker: {color: colorMapA4000[name], size: 10},
+    hoverinfo:"skip", showlegend:true
+  }));
+
+  Plotly.newPlot("plotly-sleep-mode-a4000", [barsA4000, ...legendTracesA4000], {
+    barmode: "overlay",
+    bargap: 0.05,
+    margin: {l: 140, r: 30, t: 20, b: 40},
+    xaxis: { title: "Time (seconds)", range: [0, 235] },
+    yaxis: {
+      categoryorder: "array",
+      categoryarray: ["WITHOUT Sleep Mode", "WITH Sleep Mode (L1)"]
+    },
+    hovermode: "closest",
+    dragmode: "pan"
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js
new file mode 100644
index 0000000..ef92aa3
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js
@@ -0,0 +1,131 @@
+document.addEventListener('DOMContentLoaded', function() {
+  const timingData = {
+    "WITH Sleep Mode (L1)": [
+      { event: "A Model Load", duration: 97.61 },
+      { event: "A Model Warm up", duration: 2.38 },
+      { event: "B Model Load", duration: 47.63 },
+      { event: "B Model Warm up", duration: 2.42 },
+      { event: "A Model Wake up", duration: 5.66 },
+      { event: "A Model Prompt", duration: 1.8 },
+      { event: "A Model Sleep", duration: 6.01 },
+      { event: "B Model Wake Up", duration: 2.89 },
+      { event: "B Model Prompt", duration: 1 },
+      { event: "B Model Sleep", duration: 2.78 },
+      { event: "A Model Wake up", duration: 5.29 },
+      { event: "A Model Prompt", duration: 1.7 },
+      { event: "A Model Sleep", duration: 5.78 },
+      { event: "B Model Wake Up", duration: 2.86 },
+      { event: "B Model Prompt", duration: 0.93 },
+      { event: "B Model Sleep", duration: 2.78 },
+      { event: "A Model Wake up", duration: 5.27 },
+      { event: "A Model Prompt", duration: 0.92 },
+      { event: "A Model Sleep", duration: 5.89 },
+      { event: "B Model Wake Up", duration: 2.85 },
+      { event: "B Model Prompt", duration: 0.54 }
+    ],
+    "WITHOUT Sleep Mode": [
+      { event: "A Model Load", duration: 97.9 },
+      { event: "A Model Prompt", duration: 3.8 },
+      { event: "B Model Load", duration: 47.33 },
+      { event: "B Model Prompt", duration: 3.7 },
+      { event: "A Model Load", duration: 97.4 },
+      { event: "A Model Prompt", duration: 3.7 },
+      { event: "B Model Load", duration: 47.47 },
+      { event: "B Model Prompt", duration: 2.9 },
+      { event: "A Model Load", duration: 97.71 },
+      { event: "A Model Prompt", duration: 3.72 },
+      { event: "B Model Load", duration: 47.46 },
+      { event: "B Model Prompt", duration: 2.45 }
+    ]
+  };
+
+  function createSegments(timingData) {
+    const segments = [];
+
+    Object.entries(timingData).forEach(([scenario, events]) => {
+      let cumulativeTime = 0;
+
+      events.forEach(({ event, duration }) => {
+        const [who, ...stageParts] = event.split(' ');
+        const stage = stageParts.join(' ');
+
+        // Determine action and category from stage
+        let action, category;
+        if (stage.includes('Load')) {
+          action = 'Load';
+          category = `${who} Load`;
+        } else if (stage.includes('Wake')) {
+          action = 'Wake';
+          category = `${who} Wake`;
+        } else if (stage.includes('Prompt')) {
+          action = 'Prompt';
+          category = `${who} Prompt`;
+        } else if (stage.includes('Sleep')) {
+          action = 'Sleep';
+          category = `${who} Sleep`;
+        } else if (stage.includes('Warm up')) {
+          action = 'Load';
+          category = `${who} Load`;
+        }
+
+        segments.push({
+          scenario,
+          who,
+          stage,
+          action,
+          start: cumulativeTime,
+          end: cumulativeTime + duration,
+          duration,
+          category
+        });
+
+        cumulativeTime += duration;
+      });
+    });
+
+    return segments;
+  }
+
+  const segments = createSegments(timingData);
+  const colorMap = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+  const categories = Object.keys(colorMap);
+
+  // Build arrays for a single stacked-horizontal bar trace using "base"
+  const x = segments.map(d => d.duration);
+  const base = segments.map(d => d.start);
+  const y = segments.map(d => d.scenario);
+  const colors = segments.map(d => colorMap[d.category]);
+  const custom = segments.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+  const bars = {
+    type: "bar",
+    orientation: "h",
+    x, base, y,
+    marker: { color: colors, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+    hovertemplate:
+      "<b>%{customdata[0]}</b><br>%{customdata[1]} — %{customdata[2]}<br>"+
+      "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s<br>"+
+      "<b>%{x:.2f}s</b><extra></extra>",
+    customdata: custom,
+    showlegend: false
+  };
+
+  const legendTraces = categories.map(name => ({
+    type: "scatter", mode: "markers", x:[null], y:[null],
+    name, marker: {color: colorMap[name], size: 10},
+    hoverinfo:"skip", showlegend:true
+  }));
+
+  Plotly.newPlot("plotly-sleep-mode", [bars, ...legendTraces], {
+    barmode: "overlay",
+    bargap: 0.05,
+    margin: {l: 140, r: 30, t: 20, b: 40},
+    xaxis: { title: "Time (seconds)", range: [0, 478.32] },
+    yaxis: {
+      categoryorder: "array",
+      categoryarray: ["WITHOUT Sleep Mode", "WITH Sleep Mode (L1)"]
+    },
+    hovermode: "closest",
+    dragmode: "pan"
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js
new file mode 100644
index 0000000..4013f62
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // A4000 Switching data
+  const switchingDataA4000 = {
+    "ModelA": {
+      name: "Qwen3-0.6B",
+      wake: [0.11, 0.1, 0.1],
+      cold: [21.04, 20.98, 20.98]
+    },
+    "ModelB": {
+      name: "Phi-3-vision-128k(4B)",
+      wake: [0.8, 0.8, 0.8],
+      cold: [46.01, 46.02, 46.02]
+    }
+  };
+
+  function calcStatsSwitchA4000(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  const modelsSwitchA4000 = Object.keys(switchingDataA4000);
+  const wakeStatsSwitchA4000 = modelsSwitchA4000.map(m => calcStatsSwitchA4000(switchingDataA4000[m].wake));
+  const coldStatsSwitchA4000 = modelsSwitchA4000.map(m => calcStatsSwitchA4000(switchingDataA4000[m].cold));
+
+  const wakeTraceSwitchA4000 = {
+    x: modelsSwitchA4000.map(m => switchingDataA4000[m].name),
+    y: wakeStatsSwitchA4000.map(s => s.mean),
+    name: "Wake from Sleep",
+    type: "bar",
+    marker: { color: "#2ca02c" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: wakeStatsSwitchA4000.map(s => s.errorPlus),
+      arrayminus: wakeStatsSwitchA4000.map(s => s.errorMinus),
+      color: "#1a5e1a",
+      thickness: 2,
+      width: 6
+    },
+    text: wakeStatsSwitchA4000.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Wake Time: %{y:.2f}s<extra></extra>"
+  };
+
+  const coldTraceSwitchA4000 = {
+    x: modelsSwitchA4000.map(m => switchingDataA4000[m].name),
+    y: coldStatsSwitchA4000.map(s => s.mean),
+    name: "Cold Start (Fresh Load)",
+    type: "bar",
+    marker: { color: "#d62728" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: coldStatsSwitchA4000.map(s => s.errorPlus),
+      arrayminus: coldStatsSwitchA4000.map(s => s.errorMinus),
+      color: "#8b1518",
+      thickness: 2,
+      width: 6
+    },
+    text: coldStatsSwitchA4000.map(s => s.mean.toFixed(1) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#d62728", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Cold Start: %{y:.2f}s<extra></extra>"
+  };
+
+  const speedupsSwitchA4000 = wakeStatsSwitchA4000.map((w, i) => {
+    const speedup = (coldStatsSwitchA4000[i].mean / w.mean).toFixed(0);
+    return speedup;
+  });
+
+  Plotly.newPlot("plotly-switching-a4000", [wakeTraceSwitchA4000, coldTraceSwitchA4000], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Switching Time (seconds)",
+      range: [0, Math.max(...coldStatsSwitchA4000.map(s => s.mean + s.errorPlus)) * 1.15]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: modelsSwitchA4000.map((m, i) => ({
+      x: switchingDataA4000[m].name,
+      y: coldStatsSwitchA4000[i].mean + coldStatsSwitchA4000[i].errorPlus + 3,
+      text: `<b>${speedupsSwitchA4000[i]}x faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#2ca02c", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js
new file mode 100644
index 0000000..3130701
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js
@@ -0,0 +1,107 @@
+document.addEventListener('DOMContentLoaded', function() {
+  // Raw data: Wake Time vs Cold Start Time
+  const switchingData = {
+    "ModelA": {
+      name: "Qwen3-235B-A22B (TP=4)",
+      wake: [5.66, 5.29, 5.27],
+      cold: [97.9, 97.4, 97.71]
+    },
+    "ModelB": {
+      name: "Qwen3-Coder-30B (TP=1)",
+      wake: [2.89, 2.86, 2.85],
+      cold: [47.33, 47.47, 47.46]
+    }
+  };
+
+  // Calculate mean and error bars for each model
+  function calcStatsSwitch(values) {
+    const mean = values.reduce((a, b) => a + b, 0) / values.length;
+    const min = Math.min(...values);
+    const max = Math.max(...values);
+    return { mean, errorMinus: mean - min, errorPlus: max - mean };
+  }
+
+  // Prepare traces for both models
+  const modelsSwitch = Object.keys(switchingData);
+  const wakeStatsSwitch = modelsSwitch.map(m => calcStatsSwitch(switchingData[m].wake));
+  const coldStatsSwitch = modelsSwitch.map(m => calcStatsSwitch(switchingData[m].cold));
+
+  const wakeTraceSwitch = {
+    x: modelsSwitch.map(m => switchingData[m].name),
+    y: wakeStatsSwitch.map(s => s.mean),
+    name: "Wake from Sleep",
+    type: "bar",
+    marker: { color: "#2ca02c" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: wakeStatsSwitch.map(s => s.errorPlus),
+      arrayminus: wakeStatsSwitch.map(s => s.errorMinus),
+      color: "#1a5e1a",
+      thickness: 2,
+      width: 6
+    },
+    text: wakeStatsSwitch.map(s => s.mean.toFixed(2) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Wake Time: %{y:.2f}s<extra></extra>"
+  };
+
+  const coldTraceSwitch = {
+    x: modelsSwitch.map(m => switchingData[m].name),
+    y: coldStatsSwitch.map(s => s.mean),
+    name: "Cold Start (Fresh Load)",
+    type: "bar",
+    marker: { color: "#d62728" },
+    error_y: {
+      type: "data",
+      symmetric: false,
+      array: coldStatsSwitch.map(s => s.errorPlus),
+      arrayminus: coldStatsSwitch.map(s => s.errorMinus),
+      color: "#8b1518",
+      thickness: 2,
+      width: 6
+    },
+    text: coldStatsSwitch.map(s => s.mean.toFixed(1) + "s"),
+    textposition: "outside",
+    textfont: { size: 12, color: "#d62728", weight: "bold" },
+    hovertemplate: "<b>%{x}</b><br>Cold Start: %{y:.2f}s<extra></extra>"
+  };
+
+  // Calculate speedup multiples for annotation
+  const speedupsSwitch = wakeStatsSwitch.map((w, i) => {
+    const speedup = (coldStatsSwitch[i].mean / w.mean).toFixed(0);
+    return speedup;
+  });
+
+  Plotly.newPlot("plotly-switching-comparison", [wakeTraceSwitch, coldTraceSwitch], {
+    barmode: "group",
+    bargap: 0.15,
+    bargroupgap: 0.1,
+    margin: { l: 60, r: 30, t: 40, b: 50 },
+    xaxis: {
+      title: "",
+      tickangle: 0
+    },
+    yaxis: {
+      title: "Switching Time (seconds)",
+      range: [0, Math.max(...coldStatsSwitch.map(s => s.mean + s.errorPlus)) * 1.15]
+    },
+    hovermode: "closest",
+    legend: {
+      x: 0.5,
+      y: 1.15,
+      xanchor: "center",
+      yanchor: "top",
+      orientation: "h"
+    },
+    annotations: modelsSwitch.map((m, i) => ({
+      x: switchingData[m].name,
+      y: coldStatsSwitch[i].mean + coldStatsSwitch[i].errorPlus + 5,
+      text: `<b>${speedupsSwitch[i]}x faster</b>`,
+      showarrow: false,
+      font: { size: 11, color: "#2ca02c", weight: "bold" },
+      xanchor: "center"
+    }))
+  }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/sleepmode.png b/assets/figures/2025-vllm-sleep-mode/sleepmode.png
new file mode 100644
index 0000000..4a918ec
Binary files /dev/null and b/assets/figures/2025-vllm-sleep-mode/sleepmode.png differ