diff --git a/_posts/2025-10-26-sleep-mode.md b/_posts/2025-10-26-sleep-mode.md
new file mode 100644
index 0000000..316207f
--- /dev/null
+++ b/_posts/2025-10-26-sleep-mode.md
@@ -0,0 +1,471 @@
+---
+layout: post
+title: "Zero-Reload Model Switching with vLLM Sleep Mode"
+author: "Embedded LLM"
+image: /assets/figures/2025-vllm-sleep-mode/sleepmode.png
+thumbnail-img: /assets/figures/2025-vllm-sleep-mode/sleepmode.png
+share-img: /assets/figures/2025-vllm-sleep-mode/sleepmode.png
+---
+
+## Introduction
+
+**The multi-model serving problem:** You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff:
+
+1. **Keep both models loaded** → Requires 2x the GPU memory (expensive, often impossible)
+2. **Reload models on-demand** → 30-100+ seconds per switch (slow, wasteful)
+
+
+
+**vLLM Sleep Mode offers a third way:** Models hibernate in seconds and wake up fast—delivering the efficiency of on-demand loading with the speed of persistent serving.
+
+### Two Sleep Levels for Different Needs
+
+- **Level 1:** Offloads weights to CPU RAM (fast wake time)
+- **Level 2:** Discards weights entirely (nearly as fast wake time, minimal RAM usage)
+
+Both levels are **18-200x faster** than full reload and work seamlessly with Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP).
+
+### Why Sleep Mode Beats Fast Weight Loaders
+
+Even with instant weight loading, every cold start pays hidden costs that Sleep Mode avoids:
+
+| Cost | Description | Fast Weight Loaders | Sleep Mode |
+|------|-------------|---------------------|------------|
+| 1. VRAM load time | Copying weights to GPU | ✅ Optimized | ✅ Preserved |
+| 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved |
+| 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved |
+| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ✅ Preserved (after initial warmup) |
+| 5. Cache warm-up | First-request overhead | ❌ Every time | ⚡ Quick re-warm |
+
+By keeping the process alive, Sleep Mode preserves infrastructure (#2-4) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts.
+
+**This post covers:**
+- Comprehensive benchmarks across model sizes (0.6B to 235B) and GPUs (A4000 to A100)
+- Technical deep-dives explaining the performance gains
+- Ablation studies on warm-up impact and FP8 quantization
+- Decision guide for choosing the right sleep level
+
+## Quick Start: Using Sleep Mode
+
+### Online Serving API
+
+Start two vLLM servers with Sleep Mode enabled:
+
+```bash
+# Terminal 1: Start Phi-3-vision
+export VLLM_SERVER_DEV_MODE=1
+vllm serve microsoft/Phi-3-vision-128k-instruct --enable-sleep-mode --port 8001
+
+# Terminal 2: Start Qwen3-0.6B
+export VLLM_SERVER_DEV_MODE=1
+vllm serve Qwen/Qwen3-0.6B --enable-sleep-mode --port 8002
+```
+
+### Sleep and Wake Models
+
+```bash
+# Put Phi-3-vision to sleep (Level 2 - minimal RAM usage)
+curl -X POST 'localhost:8001/sleep?level=2'
+
+# Put Qwen3-0.6B to sleep (Level 2)
+curl -X POST 'localhost:8002/sleep?level=2'
+
+# Wake up Phi-3-vision for inference
+curl -X POST 'localhost:8001/wake_up'
+curl -X POST 'localhost:8001/collective_rpc' \
+ -H 'Content-Type: application/json' \
+ -d '{"method":"reload_weights"}'
+
+# IMPORTANT: Reset prefix cache after waking (Level 2 only)
+curl -X POST 'localhost:8001/reset_prefix_cache'
+
+# Now run inference on Phi-3-vision...
+# (your inference requests here)
+
+# Put back to sleep when done
+curl -X POST 'localhost:8001/sleep?level=2'
+
+# Wake up Qwen3-0.6B
+curl -X POST 'localhost:8002/wake_up'
+# (Level 1 doesn't need reload_weights or reset_prefix_cache)
+
+# Run inference on Qwen3-0.6B...
+```
+
+> [!NOTE]
+> For Level 2 sleep, you must call `reload_weights` and `reset_prefix_cache` after waking. Level 1 sleep doesn't require these extra steps.
+
+> [!WARNING]
+> **Security:** The `/sleep`, `/wake_up`, `/collective_rpc`, and `/reset_prefix_cache` endpoints require `VLLM_SERVER_DEV_MODE=1` and should only be exposed in trusted networks. These administrative endpoints can disrupt service and are intended for closed environments like training clusters or backend applications.
+
+## Performance Overview
+
+Let's see how Sleep Mode performs compared to traditional model reloading.
+
+### Sleep Mode L1 vs No Sleep Mode Performance
+
+The interactive chart below shows the **total time to perform 5 model switches**: running inference on Model A, switching to Model B, running inference on Model B, then repeating this pattern (A→B→A→B→A→B).
+
+**With Sleep Mode:** Models sleep/wake between switches, preserving infrastructure.
+**Without Sleep Mode:** Each switch requires a full vLLM restart and reload.
+
+
+
+
+
+ Model A: Qwen3-235B-A22B-Instruct-2507-FP8 (TP=4) | Model B: Qwen3-Coder-30B-A3B-Instruct (TP=1)
+ GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+
+## Inference Performance Boost
+
+Beyond faster model switching, Sleep Mode also delivers **faster inference times**. Because models are already warmed up when woken from sleep, they skip the cold start overhead that affects freshly loaded models.
+
+
+
+
+ Inference time comparison showing wake mode (already warmed up) vs cold start (just loaded).
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+#### Why Sleep Mode Improves Inference Speed
+
+The 61-88% inference speedup isn't from faster weight loading—it's from **preserving expensive infrastructure** that cold starts must rebuild from scratch.
+
+**What Sleep Mode Preserves:**
+
+| Component | Preserved? | Cold Start Must Pay |
+|-----------|-----------|---------------------|
+| Memory allocator (CuMemAllocator) | ✅ Yes | ❌ Reinitialize every time |
+| CUDA graphs | ✅ Yes | ❌ Re-capture every time |
+| Process state (Python, CUDA context) | ✅ Yes | ❌ Restart every time |
+| GPU kernel JIT cache | ✅ Yes (after initial warmup) | ❌ Recompile every time |
+
+**The Critical Difference:**
+
+- **Without Sleep Mode:** Process dies on unload → **You CANNOT benefit from pre-warm-up**
+ - Must restart Python process and CUDA context
+ - Must reinitialize memory allocator
+ - Must re-capture CUDA graphs
+ - Must re-JIT compile kernels (DeepGEMM, FlashInfer, TorchInductor)
+ - **Result:** First inference is **4-7x slower** (see benchmarks: 0.92s wake vs 3.72s cold start)
+
+- **With Sleep Mode:** Process stays alive → **Pre-warm-up pays off**
+ - ✅ Allocator, graphs, process state, and JIT kernels all preserved after initial warmup
+ - **Result:** First inference stays fast (~1s), avoiding the 3-4s cold start penalty
+
+> [!NOTE]
+> Timing varies significantly by model size, GPU generation, and configuration. See the [Impact of Warm-Up](#impact-of-warm-up-on-sleep-mode) section for detailed measurements showing 5-7x slowdown without warm-up.
+
+## Model Switching Performance
+
+The most dramatic benefit of Sleep Mode is in model switching time. Waking a sleeping model is **18-20x faster** than loading a fresh vLLM instance.
+
+
+
+
+ Model switching time: Wake from sleep vs cold start (fresh load).
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+## Hardware Scalability: A4000 GPU Results
+
+Sleep Mode benefits aren't limited to high-end GPUs. Here's the same workload on an **A4000 GPU** with smaller models, demonstrating that the performance gains scale across different hardware tiers and model sizes.
+
+
+
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+
+### A4000: Inference Performance
+
+
+
+
+ Inference time comparison on A4000: wake mode (already warmed up) vs cold start (just loaded).
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+### A4000: Model Switching Performance
+
+
+
+
+ Model switching time on A4000: Wake from sleep vs cold start (fresh load).
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+**Key Observations on A4000:**
+- **Inference Performance:** Wake mode delivers 83% faster inference for Qwen3-0.6B and 81% faster for Phi-3-vision
+- **Model Switching:** Wake times are incredibly fast (~0.1-0.8s), achieving **58-203x speedup** vs cold starts
+- **Total time savings: 62%** (85s vs 226s for 5 model switches)
+- **Near-instant switching** for small models (0.1s wake time), making multi-model serving feel seamless
+- Demonstrates that Sleep Mode is effective across different GPU classes and model sizes
+
+## Sleep Levels: Choosing the Right Mode
+
+vLLM Sleep Mode offers two levels with different tradeoffs:
+
+**Level 1 (Default):** Offloads model weights to CPU memory, discards KV cache
+- **Fastest wake times** (~0.1-0.8s for small models, ~3-6s for large models)
+- **Requires sufficient CPU RAM** to store model weights
+- **Best for:** Systems with adequate CPU memory, frequent model switching
+
+**Level 2:** Discards model weights and KV cache, keeps only buffers (rope scaling tensors, etc.) in CPU
+- **Slower wake times** (~0.8-2.6s for small models) due to weight reload from disk
+- **Minimal CPU RAM usage** - only small buffers retained
+- **Best for:** Systems with limited CPU RAM or when managing many models that won't all fit in memory
+
+### Performance Comparison: Level 1 vs Level 2 vs No Sleep
+
+
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ Comparing all three modes: Level 1 (fastest), Level 2 (minimal RAM), No Sleep. Hover for exact timing.
+
+
+
+
+**Performance Summary:**
+
+| Mode | Total Time | Wake Time (A/B) | CPU RAM | Best For |
+|------|------------|-----------------|---------|----------|
+| **No Sleep** | 357.1s | N/A (full reload) | Minimal | Single model, no switching |
+| **Level 1** | 112.6s | 0.26s / 0.82s | High (~GB per model) | Frequent switching, ample RAM |
+| **Level 2** | 124.6s | 0.85s / 2.58s | Minimal (~MB per model) | Limited RAM, cost optimization |
+
+**Key Insights:**
+- **Level 1 is fastest** (68% faster than no sleep) but needs significant CPU RAM
+- **Level 2 is nearly as fast** (65% faster than no sleep) with minimal RAM requirements
+- **Level 2 wake is ~3x slower than Level 1** (0.85s vs 0.26s for Qwen3-0.6B) due to weight reload
+- Both sleep modes deliver **massive improvements** over no sleep mode
+
+#### Why Level 2 is Still Faster Than No Sleep Mode
+
+At first glance, this seems counterintuitive: **Level 2 reloads weights from SSD** (just like "No Sleep Mode"), so why is it **23-45x faster overall?**
+
+**The Answer: Weight loading is only ONE of FIVE costs**
+
+When you reload a model without Sleep Mode, you pay all these costs:
+
+| Cost | Level 2 | No Sleep Mode |
+|------|---------|---------------|
+| 1. Weight load (SSD → VRAM) | ❌ Must pay | ❌ Must pay |
+| 2. Process initialization | ✅ **Skipped** | ❌ Must pay |
+| 3. Memory allocator setup | ✅ **Skipped** | ❌ Must pay |
+| 4. CUDA graph capture | ✅ **Skipped** | ❌ Must pay |
+| 5. GPU kernel JIT compilation | ✅ **Preserved (already compiled)** | ❌ Full compilation + warm-up |
+
+**Level 2 Strategy:**
+- Weight reload from SSD (same as No Sleep)
+- **Everything else preserved:** Process state, allocator instance, CUDA graphs, and compiled JIT kernels all intact
+- **No recompilation needed:** Kernels were compiled during initial warmup and remain cached
+- **Average per switch: ~2.6s** (see benchmark data above)
+
+**No Sleep Mode Reality:**
+- Weight reload from SSD (same as Level 2)
+- **Everything else rebuilt:** Process restart + allocator init + graph re-capture
+- **JIT kernels:** Full compilation + explicit warm-up routine (`kernel_warmup()` + dummy runs)
+- **Average per switch: ~48s** (see benchmark data above)
+
+**The benchmark data proves it:** For 5 model switches:
+- **Level 2:** 124.6s total (average ~2.6s per switch)
+- **No Sleep:** 357.1s total (average ~48s per switch)
+
+Even though both reload weights from SSD, Level 2 is **2.9x faster overall** because it preserves the expensive infrastructure (process state, allocator, CUDA graphs) that No Sleep Mode must rebuild from scratch every single time.
+
+### Level 2: Inference Performance
+
+
+
+
+ Inference time comparison with Sleep Level 2: wake mode vs cold start.
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+### Level 2: Model Switching Performance
+
+
+
+
+ Model switching time with Sleep Level 2: wake from sleep vs cold start.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+**Key Observations:**
+
+| Metric | No Sleep | Level 2 | Improvement |
+|--------|----------|---------|-------------|
+| **Total Time (5 switches)** | 357.1s | 124.6s | **65% faster** |
+| **Qwen3-0.6B Switch Time** | 37.6s avg | 0.85s avg | **45x faster** |
+| **Phi-3-vision Switch Time** | 58.1s avg | 2.58s avg | **23x faster** |
+| **Qwen3-0.6B Inference** | 3.67s avg | 0.53s avg | **86% faster** |
+| **Phi-3-vision Inference** | 6.30s avg | 0.76s avg | **88% faster** |
+| **Wake Time vs Level 1** | - | 3-10x slower | Trade CPU RAM for speed |
+
+**When to Use Level 2:**
+- **Limited CPU RAM:** System cannot hold all model weights in CPU memory
+- **Cost Optimization:** Cheaper cloud instances with less CPU RAM
+- **Many Models:** Switching between many models where CPU memory is a constraint
+- **Still Significant Gains:** Even with weight reload, Level 2 is 23-45x faster than no sleep mode
+
+**Level 1 vs Level 2 Comparison:**
+- Level 1: ~0.1-0.8s wake time, needs ~10-100GB+ CPU RAM per model
+- Level 2: ~0.8-2.6s wake time, needs only ~MB CPU RAM per model
+- Both dramatically faster than full reload (~20-100s)
+
+## Ablation Studies
+
+### Impact of Warm-Up on Sleep Mode
+
+Does skipping the warm-up phase affect performance? Warm-up pre-compiles CUDA graphs during initial load, which can take several seconds. Let's compare with and without warm-up.
+
+
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ Comparing with warm-up (pre-compiled) vs without warm-up (lazy compilation). Hover for exact timing.
+
+
+
+
+**Key Findings:**
+
+| Metric | With Warm-Up | Without Warm-Up | Difference |
+|--------|--------------|-----------------|------------|
+| **Initial Load Time** | 108.7s (includes 8.4s warm-up) | 101.1s (no warm-up) | 7.6s saved initially |
+| **First Inference (A)** | 0.45s | 2.59s | **5.8x slower** without warm-up |
+| **First Inference (B)** | 0.93s | 6.61s | **7.1x slower** without warm-up |
+| **Subsequent Inferences** | 0.43s avg | 0.41s avg | No difference |
+| **Total Time (5 switches)** | 119.5s | 119.0s | Nearly identical |
+
+**Insights:**
+- **Warm-Up Compiles Kernels Once, Benefits All Wake Cycles:** With initial warmup, JIT compilation and CUDA graph capture happen once during load and are preserved across all subsequent sleep/wake cycles
+- **Without Warm-Up, Every Wake-Up Pays Compilation Cost:** The 5-7x slowdown happens on the first inference after **every single wake-up**, not just once
+- **Compiled Kernels Are Preserved Across Sleep/Wake:** After warmup during initial load (8.4s), all subsequent wake-ups have fast first inference (0.45s, 0.93s) proving kernels stay cached
+- **Minimal Warmup Sufficient:** A single 1-token inference is enough to trigger full JIT compilation and CUDA graph capture, making warmup very cheap
+- **Trade Initial Load Time for Consistent Performance:** The 8.4s warmup cost is paid once and amortized across all model switches
+- **Recommendation: Always Use Warm-Up** for production workloads where consistent, fast inference is expected
+
+### Impact of Quantization on Sleep Mode
+
+Does quantization (FP8) affect Sleep Mode performance? We tested the same workload with and without FP8 quantization on A100 GPU.
+
+
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ Comparing BF16 (baseline) vs FP8 quantization. Hover for exact timing.
+
+
+
+
+### Ablation: Inference Performance (BF16 vs FP8)
+
+
+
+
+ Inference time comparison: BF16 vs FP8 quantization with Sleep Mode.
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+### Ablation: Model Switching (BF16 vs FP8)
+
+
+
+
+ Model switching time: BF16 vs FP8 quantization with Sleep Mode.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+
+
+
+
+**Key Findings:**
+
+| Metric | BF16 | FP8 | Improvement |
+|--------|------|-----|-------------|
+| **Total Time (5 switches)** | 108.2s | 113.6s | -5% (slightly slower) |
+| **Qwen3-0.6B Wake Time** | 0.27s avg | 0.18s avg | **33% faster** |
+| **Phi-3-vision Wake Time** | 0.90s avg | 0.78s avg | **13% faster** |
+| **Qwen3-0.6B Inference** | 0.41s avg | 0.44s avg | -7% (slightly slower) |
+| **Phi-3-vision Inference** | 0.81s avg | 0.57s avg | **30% faster** |
+| **Initial Load Time** | 90.5s | 96.9s | -7% (longer warmup) |
+
+**Insights:**
+- **FP8 has faster wake operations** (13-33% faster) due to less memory movement
+- **FP8 improves inference for larger models** (30% faster for Phi-3-vision) but shows minimal difference for tiny models
+- **Initial load takes longer with FP8** due to quantization overhead during warmup
+- **After initial load, FP8 provides smoother switching** with faster wake cycles
+- For workloads with frequent switching, FP8's faster wake times can offset the longer initial load
+
+## Decision Guide: Which Sleep Level to Use?
+
+### Use Sleep Level 1 When:
+- You have sufficient CPU RAM to hold all model weights
+- You need the fastest possible wake times (0.1-6s)
+- You're switching models very frequently (every few seconds/minutes)
+- Inference latency consistency is critical
+
+### Use Sleep Level 2 When:
+- CPU RAM is limited (can't hold all model weights)
+- You're optimizing cloud costs (cheaper instances with less RAM)
+- You have many models to manage (10+)
+
+### Skip Sleep Mode When:
+- You're only using a single model (no switching needed)
+- Model switches are extremely rare (once per day/week)
+- Both models fit simultaneously in GPU memory
+
+## Conclusion
+
+vLLM Sleep Mode transforms multi-model GPU serving from a 30-100 second reload penalty into sub-second switches. The benchmarks speak for themselves:
+
+- **18-200x faster model switching** depending on model size and hardware
+- **61-88% faster inference** for warmed models vs cold starts
+- **65-68% total time savings** across complete workloads
+- **Works at every scale:** 0.6B to 235B parameters, small and large GPUs
+
+The future of LLM serving is multi-model. Sleep Mode makes it practical today.
+
+## Acknowledgements
+
+Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, **Ye Hur Cheong**, **Zhiyao Cen** and **Kaichao You** for developing the Sleep Mode feature and this blog post.
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js
new file mode 100644
index 0000000..0f9b772
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js
@@ -0,0 +1,91 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Ablation inference data: BF16 vs FP8
+ const ablationInferenceData = {
+ "ModelA": {
+ name: "Qwen3-0.6B",
+ bf16: [0.41, 0.4, 0.41],
+ fp8: [0.43, 0.43, 0.45]
+ },
+ "ModelB": {
+ name: "Phi-3-vision-128k",
+ bf16: [0.9, 0.74, 0.8],
+ fp8: [0.69, 0.59, 0.44]
+ }
+ };
+
+ function calcStatsAblInf(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ const modelsAblInf = Object.keys(ablationInferenceData);
+ const bf16StatsInf = modelsAblInf.map(m => calcStatsAblInf(ablationInferenceData[m].bf16));
+ const fp8StatsInf = modelsAblInf.map(m => calcStatsAblInf(ablationInferenceData[m].fp8));
+
+ const bf16TraceInf = {
+ x: modelsAblInf.map(m => ablationInferenceData[m].name),
+ y: bf16StatsInf.map(s => s.mean),
+ name: "BF16",
+ type: "bar",
+ marker: { color: "#1f77b4" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: bf16StatsInf.map(s => s.errorPlus),
+ arrayminus: bf16StatsInf.map(s => s.errorMinus),
+ color: "#0d4a6e",
+ thickness: 2,
+ width: 6
+ },
+ text: bf16StatsInf.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#1f77b4", weight: "bold" },
+ hovertemplate: "%{x}
BF16: %{y:.2f}s"
+ };
+
+ const fp8TraceInf = {
+ x: modelsAblInf.map(m => ablationInferenceData[m].name),
+ y: fp8StatsInf.map(s => s.mean),
+ name: "FP8",
+ type: "bar",
+ marker: { color: "#ff7f0e" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: fp8StatsInf.map(s => s.errorPlus),
+ arrayminus: fp8StatsInf.map(s => s.errorMinus),
+ color: "#cc6600",
+ thickness: 2,
+ width: 6
+ },
+ text: fp8StatsInf.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#ff7f0e", weight: "bold" },
+ hovertemplate: "%{x}
FP8: %{y:.2f}s"
+ };
+
+ Plotly.newPlot("plotly-ablation-inference", [bf16TraceInf, fp8TraceInf], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Inference Time (seconds)",
+ range: [0, Math.max(...bf16StatsInf.map(s => s.mean + s.errorPlus), ...fp8StatsInf.map(s => s.mean + s.errorPlus)) * 1.25]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ }
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js
new file mode 100644
index 0000000..85a4ec8
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js
@@ -0,0 +1,140 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Ablation study: BF16 vs FP8 quantization
+ const timingDataAblation = {
+ "Sleep Mode (BF16)": [
+ { event: "A Model Load", duration: 32.56 },
+ { event: "A Model Warm Up", duration: 2.69 },
+ { event: "B Model Load", duration: 57.96 },
+ { event: "B Model Warm Up", duration: 5.92 },
+ { event: "A Model Wake up", duration: 0.28 },
+ { event: "A Model Prompt", duration: 0.41 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.89 },
+ { event: "B Model Prompt", duration: 0.9 },
+ { event: "B Model Sleep", duration: 0.48 },
+ { event: "A Model Wake up", duration: 0.27 },
+ { event: "A Model Prompt", duration: 0.4 },
+ { event: "A Model Sleep", duration: 0.1 },
+ { event: "B Model Wake Up", duration: 0.93 },
+ { event: "B Model Prompt", duration: 0.74 },
+ { event: "B Model Sleep", duration: 0.5 },
+ { event: "A Model Wake up", duration: 0.27 },
+ { event: "A Model Prompt", duration: 0.41 },
+ { event: "A Model Sleep", duration: 0.1 },
+ { event: "B Model Wake Up", duration: 0.88 },
+ { event: "B Model Prompt", duration: 0.8 }
+ ],
+ "Sleep Mode (FP8)": [
+ { event: "A Model Load", duration: 37.71 },
+ { event: "A Model Warm Up", duration: 2.34 },
+ { event: "B Model Load", duration: 57.79 },
+ { event: "B Model Warm Up", duration: 6.37 },
+ { event: "A Model Wake up", duration: 0.18 },
+ { event: "A Model Prompt", duration: 0.43 },
+ { event: "A Model Sleep", duration: 0.06 },
+ { event: "B Model Wake Up", duration: 0.79 },
+ { event: "B Model Prompt", duration: 0.69 },
+ { event: "B Model Sleep", duration: 0.31 },
+ { event: "A Model Wake up", duration: 0.19 },
+ { event: "A Model Prompt", duration: 0.43 },
+ { event: "A Model Sleep", duration: 0.06 },
+ { event: "B Model Wake Up", duration: 0.77 },
+ { event: "B Model Prompt", duration: 0.59 },
+ { event: "B Model Sleep", duration: 0.31 },
+ { event: "A Model Wake up", duration: 0.16 },
+ { event: "A Model Prompt", duration: 0.45 },
+ { event: "A Model Sleep", duration: 0.07 },
+ { event: "B Model Wake Up", duration: 0.78 },
+ { event: "B Model Prompt", duration: 0.44 }
+ ]
+ };
+
+ // Convert to segment format
+ function createSegmentsAblation(timingData) {
+ const segments = [];
+
+ Object.entries(timingData).forEach(([scenario, events]) => {
+ let cumulativeTime = 0;
+
+ events.forEach(({ event, duration }) => {
+ const [who, ...stageParts] = event.split(' ');
+ const stage = stageParts.join(' ');
+
+ let action, category;
+ if (stage.includes('Load')) {
+ action = 'Load';
+ category = `${who} Load`;
+ } else if (stage.includes('Wake')) {
+ action = 'Wake';
+ category = `${who} Wake`;
+ } else if (stage.includes('Prompt')) {
+ action = 'Prompt';
+ category = `${who} Prompt`;
+ } else if (stage.includes('Sleep')) {
+ action = 'Sleep';
+ category = `${who} Sleep`;
+ } else if (stage.includes('Warm')) {
+ action = 'Load';
+ category = `${who} Load`;
+ }
+
+ segments.push({
+ scenario,
+ who,
+ stage,
+ action,
+ start: cumulativeTime,
+ end: cumulativeTime + duration,
+ duration,
+ category
+ });
+
+ cumulativeTime += duration;
+ });
+ });
+
+ return segments;
+ }
+
+ const segmentsAblation = createSegmentsAblation(timingDataAblation);
+ const colorMapAblation = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+ const categoriesAblation = Object.keys(colorMapAblation);
+
+ const xAblation = segmentsAblation.map(d => d.duration);
+ const baseAblation = segmentsAblation.map(d => d.start);
+ const yAblation = segmentsAblation.map(d => d.scenario);
+ const colorsAblation = segmentsAblation.map(d => colorMapAblation[d.category]);
+ const customAblation = segmentsAblation.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+ const barsAblation = {
+ type: "bar",
+ orientation: "h",
+ x: xAblation, base: baseAblation, y: yAblation,
+ marker: { color: colorsAblation, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+ hovertemplate:
+ "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+
+ "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+
+ "%{x:.2f}s",
+ customdata: customAblation,
+ showlegend: false
+ };
+
+ const legendTracesAblation = categoriesAblation.map(name => ({
+ type: "scatter", mode: "markers", x:[null], y:[null],
+ name, marker: {color: colorMapAblation[name], size: 10},
+ hoverinfo:"skip", showlegend:true
+ }));
+
+ Plotly.newPlot("plotly-ablation-quant", [barsAblation, ...legendTracesAblation], {
+ barmode: "overlay",
+ bargap: 0.05,
+ margin: {l: 140, r: 30, t: 20, b: 40},
+ xaxis: { title: "Time (seconds)", range: [0, 115] },
+ yaxis: {
+ categoryorder: "array",
+ categoryarray: ["Sleep Mode (FP8)", "Sleep Mode (BF16)"]
+ },
+ hovermode: "closest",
+ dragmode: "pan"
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js
new file mode 100644
index 0000000..e2f0f94
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js
@@ -0,0 +1,105 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Ablation switching data: BF16 vs FP8
+ const ablationSwitchingData = {
+ "ModelA": {
+ name: "Qwen3-0.6B",
+ bf16: [0.28, 0.27, 0.27],
+ fp8: [0.18, 0.19, 0.16]
+ },
+ "ModelB": {
+ name: "Phi-3-vision-128k",
+ bf16: [0.89, 0.93, 0.88],
+ fp8: [0.79, 0.77, 0.78]
+ }
+ };
+
+ function calcStatsAblSwitch(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ const modelsAblSwitch = Object.keys(ablationSwitchingData);
+ const bf16StatsSwitch = modelsAblSwitch.map(m => calcStatsAblSwitch(ablationSwitchingData[m].bf16));
+ const fp8StatsSwitch = modelsAblSwitch.map(m => calcStatsAblSwitch(ablationSwitchingData[m].fp8));
+
+ const bf16TraceSwitch = {
+ x: modelsAblSwitch.map(m => ablationSwitchingData[m].name),
+ y: bf16StatsSwitch.map(s => s.mean),
+ name: "BF16",
+ type: "bar",
+ marker: { color: "#1f77b4" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: bf16StatsSwitch.map(s => s.errorPlus),
+ arrayminus: bf16StatsSwitch.map(s => s.errorMinus),
+ color: "#0d4a6e",
+ thickness: 2,
+ width: 6
+ },
+ text: bf16StatsSwitch.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#1f77b4", weight: "bold" },
+ hovertemplate: "%{x}
BF16: %{y:.2f}s"
+ };
+
+ const fp8TraceSwitch = {
+ x: modelsAblSwitch.map(m => ablationSwitchingData[m].name),
+ y: fp8StatsSwitch.map(s => s.mean),
+ name: "FP8",
+ type: "bar",
+ marker: { color: "#ff7f0e" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: fp8StatsSwitch.map(s => s.errorPlus),
+ arrayminus: fp8StatsSwitch.map(s => s.errorMinus),
+ color: "#cc6600",
+ thickness: 2,
+ width: 6
+ },
+ text: fp8StatsSwitch.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#ff7f0e", weight: "bold" },
+ hovertemplate: "%{x}
FP8: %{y:.2f}s"
+ };
+
+ // Calculate speedup percentages for annotation
+ const speedupsSwitchAbl = bf16StatsSwitch.map((bf16, i) => {
+ const reduction = ((bf16.mean - fp8StatsSwitch[i].mean) / bf16.mean * 100).toFixed(0);
+ return reduction;
+ });
+
+ Plotly.newPlot("plotly-ablation-switching", [bf16TraceSwitch, fp8TraceSwitch], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Wake Time (seconds)",
+ range: [0, Math.max(...bf16StatsSwitch.map(s => s.mean + s.errorPlus)) * 1.3]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: modelsAblSwitch.map((m, i) => ({
+ x: ablationSwitchingData[m].name,
+ y: bf16StatsSwitch[i].mean + bf16StatsSwitch[i].errorPlus + 0.07,
+ text: `${speedupsSwitchAbl[i]}% faster`,
+ showarrow: false,
+ font: { size: 11, color: "#ff7f0e", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js
new file mode 100644
index 0000000..2469df2
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js
@@ -0,0 +1,138 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Ablation study: With vs Without Warm-Up
+ const timingDataWarmup = {
+ "With Warm-Up": [
+ { event: "A Model Load", duration: 37.65 },
+ { event: "A Model Warm Up", duration: 2.39 },
+ { event: "B Model Load", duration: 62.69 },
+ { event: "B Model Warm Up", duration: 6 },
+ { event: "A Model Wake up", duration: 0.24 },
+ { event: "A Model Prompt", duration: 0.45 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.89 },
+ { event: "B Model Prompt", duration: 0.93 },
+ { event: "B Model Sleep", duration: 0.47 },
+ { event: "A Model Wake up", duration: 0.23 },
+ { event: "A Model Prompt", duration: 0.43 },
+ { event: "A Model Sleep", duration: 0.1 },
+ { event: "B Model Wake Up", duration: 0.87 },
+ { event: "B Model Prompt", duration: 0.73 },
+ { event: "B Model Sleep", duration: 0.46 },
+ { event: "A Model Wake up", duration: 0.23 },
+ { event: "A Model Prompt", duration: 0.46 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.85 },
+ { event: "B Model Prompt", duration: 0.73 }
+ ],
+ "Without Warm-Up": [
+ { event: "A Model Load", duration: 37.91 },
+ { event: "B Model Load", duration: 63.16 },
+ { event: "A Model Wake up", duration: 0.24 },
+ { event: "A Model Prompt", duration: 2.59 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.91 },
+ { event: "B Model Prompt", duration: 6.61 },
+ { event: "B Model Sleep", duration: 0.44 },
+ { event: "A Model Wake up", duration: 0.26 },
+ { event: "A Model Prompt", duration: 0.41 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.87 },
+ { event: "B Model Prompt", duration: 0.7 },
+ { event: "B Model Sleep", duration: 0.43 },
+ { event: "A Model Wake up", duration: 0.27 },
+ { event: "A Model Prompt", duration: 0.42 },
+ { event: "A Model Sleep", duration: 0.1 },
+ { event: "B Model Wake Up", duration: 0.86 },
+ { event: "B Model Prompt", duration: 0.7 }
+ ]
+ };
+
+ // Convert to segment format
+ function createSegmentsWarmup(timingData) {
+ const segments = [];
+
+ Object.entries(timingData).forEach(([scenario, events]) => {
+ let cumulativeTime = 0;
+
+ events.forEach(({ event, duration }) => {
+ const [who, ...stageParts] = event.split(' ');
+ const stage = stageParts.join(' ');
+
+ let action, category;
+ if (stage.includes('Load')) {
+ action = 'Load';
+ category = `${who} Load`;
+ } else if (stage.includes('Wake')) {
+ action = 'Wake';
+ category = `${who} Wake`;
+ } else if (stage.includes('Prompt')) {
+ action = 'Prompt';
+ category = `${who} Prompt`;
+ } else if (stage.includes('Sleep')) {
+ action = 'Sleep';
+ category = `${who} Sleep`;
+ } else if (stage.includes('Warm')) {
+ action = 'Load';
+ category = `${who} Load`;
+ }
+
+ segments.push({
+ scenario,
+ who,
+ stage,
+ action,
+ start: cumulativeTime,
+ end: cumulativeTime + duration,
+ duration,
+ category
+ });
+
+ cumulativeTime += duration;
+ });
+ });
+
+ return segments;
+ }
+
+ const segmentsWarmup = createSegmentsWarmup(timingDataWarmup);
+ const colorMapWarmup = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+ const categoriesWarmup = Object.keys(colorMapWarmup);
+
+ const xWarmup = segmentsWarmup.map(d => d.duration);
+ const baseWarmup = segmentsWarmup.map(d => d.start);
+ const yWarmup = segmentsWarmup.map(d => d.scenario);
+ const colorsWarmup = segmentsWarmup.map(d => colorMapWarmup[d.category]);
+ const customWarmup = segmentsWarmup.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+ const barsWarmup = {
+ type: "bar",
+ orientation: "h",
+ x: xWarmup, base: baseWarmup, y: yWarmup,
+ marker: { color: colorsWarmup, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+ hovertemplate:
+ "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+
+ "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+
+ "%{x:.2f}s",
+ customdata: customWarmup,
+ showlegend: false
+ };
+
+ const legendTracesWarmup = categoriesWarmup.map(name => ({
+ type: "scatter", mode: "markers", x:[null], y:[null],
+ name, marker: {color: colorMapWarmup[name], size: 10},
+ hoverinfo:"skip", showlegend:true
+ }));
+
+ Plotly.newPlot("plotly-ablation-warmup", [barsWarmup, ...legendTracesWarmup], {
+ barmode: "overlay",
+ bargap: 0.05,
+ margin: {l: 140, r: 30, t: 20, b: 40},
+ xaxis: { title: "Time (seconds)", range: [0, 120] },
+ yaxis: {
+ categoryorder: "array",
+ categoryarray: ["Without Warm-Up", "With Warm-Up"]
+ },
+ hovermode: "closest",
+ dragmode: "pan"
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js
new file mode 100644
index 0000000..5f1f803
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // A4000 Inference data
+ const inferenceDataA4000 = {
+ "ModelA": {
+ name: "Qwen3-0.6B",
+ wake: [0.44, 0.43, 0.43],
+ cold: [2.64, 2.5, 2.63]
+ },
+ "ModelB": {
+ name: "Phi-3-vision-128k(4B)",
+ wake: [2.04, 1.73, 1.61],
+ cold: [9.78, 9.01, 9.79]
+ }
+ };
+
+ function calcStatsInfA4000(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ const modelsInfA4000 = Object.keys(inferenceDataA4000);
+ const wakeStatsInfA4000 = modelsInfA4000.map(m => calcStatsInfA4000(inferenceDataA4000[m].wake));
+ const coldStatsInfA4000 = modelsInfA4000.map(m => calcStatsInfA4000(inferenceDataA4000[m].cold));
+
+ const wakeTraceInfA4000 = {
+ x: modelsInfA4000.map(m => inferenceDataA4000[m].name),
+ y: wakeStatsInfA4000.map(s => s.mean),
+ name: "Wake Mode (Warmed Up)",
+ type: "bar",
+ marker: { color: "#2ca02c" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: wakeStatsInfA4000.map(s => s.errorPlus),
+ arrayminus: wakeStatsInfA4000.map(s => s.errorMinus),
+ color: "#1a5e1a",
+ thickness: 2,
+ width: 6
+ },
+ text: wakeStatsInfA4000.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+ hovertemplate: "%{x}
Wake Mode: %{y:.2f}s"
+ };
+
+ const coldTraceInfA4000 = {
+ x: modelsInfA4000.map(m => inferenceDataA4000[m].name),
+ y: coldStatsInfA4000.map(s => s.mean),
+ name: "Cold Start (Just Loaded)",
+ type: "bar",
+ marker: { color: "#d62728" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: coldStatsInfA4000.map(s => s.errorPlus),
+ arrayminus: coldStatsInfA4000.map(s => s.errorMinus),
+ color: "#8b1518",
+ thickness: 2,
+ width: 6
+ },
+ text: coldStatsInfA4000.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#d62728", weight: "bold" },
+ hovertemplate: "%{x}
Cold Start: %{y:.2f}s"
+ };
+
+ const speedupsInfA4000 = wakeStatsInfA4000.map((w, i) => {
+ const reduction = ((coldStatsInfA4000[i].mean - w.mean) / coldStatsInfA4000[i].mean * 100).toFixed(0);
+ return reduction;
+ });
+
+ Plotly.newPlot("plotly-inference-a4000", [wakeTraceInfA4000, coldTraceInfA4000], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Inference Time (seconds)",
+ range: [0, Math.max(...coldStatsInfA4000.map(s => s.mean + s.errorPlus)) * 1.2]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: modelsInfA4000.map((m, i) => ({
+ x: inferenceDataA4000[m].name,
+ y: coldStatsInfA4000[i].mean + coldStatsInfA4000[i].errorPlus + 0.6,
+ text: `${speedupsInfA4000[i]}% faster`,
+ showarrow: false,
+ font: { size: 11, color: "#2ca02c", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js
new file mode 100644
index 0000000..80afa76
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js
@@ -0,0 +1,107 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Raw data: Wake Inference Time vs Cold Start Inference Time
+ const inferenceData = {
+ "ModelA": {
+ name: "Qwen3-235B-A22B (TP=4)",
+ wake: [1.8, 1.7, 0.92],
+ cold: [3.8, 3.7, 3.72]
+ },
+ "ModelB": {
+ name: "Qwen3-Coder-30B (TP=1)",
+ wake: [1.0, 0.93, 0.54],
+ cold: [3.7, 2.9, 2.45]
+ }
+ };
+
+ // Calculate mean and error bars for each model
+ function calcStats(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ // Prepare traces for both models
+ const models = Object.keys(inferenceData);
+ const wakeStats = models.map(m => calcStats(inferenceData[m].wake));
+ const coldStats = models.map(m => calcStats(inferenceData[m].cold));
+
+ const wakeTrace = {
+ x: models.map(m => inferenceData[m].name),
+ y: wakeStats.map(s => s.mean),
+ name: "Wake Mode (Warmed Up)",
+ type: "bar",
+ marker: { color: "#2ca02c" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: wakeStats.map(s => s.errorPlus),
+ arrayminus: wakeStats.map(s => s.errorMinus),
+ color: "#1a5e1a",
+ thickness: 2,
+ width: 6
+ },
+ text: wakeStats.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+ hovertemplate: "%{x}
Wake Mode: %{y:.2f}s"
+ };
+
+ const coldTrace = {
+ x: models.map(m => inferenceData[m].name),
+ y: coldStats.map(s => s.mean),
+ name: "Cold Start (Just Loaded)",
+ type: "bar",
+ marker: { color: "#d62728" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: coldStats.map(s => s.errorPlus),
+ arrayminus: coldStats.map(s => s.errorMinus),
+ color: "#8b1518",
+ thickness: 2,
+ width: 6
+ },
+ text: coldStats.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#d62728", weight: "bold" },
+ hovertemplate: "%{x}
Cold Start: %{y:.2f}s"
+ };
+
+ // Calculate speedup percentages for annotation
+ const speedups = wakeStats.map((w, i) => {
+ const reduction = ((coldStats[i].mean - w.mean) / coldStats[i].mean * 100).toFixed(0);
+ return reduction;
+ });
+
+ Plotly.newPlot("plotly-inference-comparison", [wakeTrace, coldTrace], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Inference Time (seconds)",
+ range: [0, Math.max(...coldStats.map(s => s.mean + s.errorPlus)) * 1.2]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: models.map((m, i) => ({
+ x: inferenceData[m].name,
+ y: coldStats[i].mean + coldStats[i].errorPlus + 0.3,
+ text: `${speedups[i]}% faster`,
+ showarrow: false,
+ font: { size: 11, color: "#2ca02c", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js b/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js
new file mode 100644
index 0000000..48082f2
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Level 2 inference data
+ const level2InferenceData = {
+ "ModelA": {
+ name: "Qwen3-0.6B",
+ wake: [0.68, 0.46, 0.44],
+ cold: [4.66, 3.8, 2.56]
+ },
+ "ModelB": {
+ name: "Phi-3-vision-128k",
+ wake: [0.78, 0.77, 0.72],
+ cold: [6.55, 6.21, 6.15]
+ }
+ };
+
+ function calcStatsLevel2Inf(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ const modelsLevel2Inf = Object.keys(level2InferenceData);
+ const wakeStatsLevel2Inf = modelsLevel2Inf.map(m => calcStatsLevel2Inf(level2InferenceData[m].wake));
+ const coldStatsLevel2Inf = modelsLevel2Inf.map(m => calcStatsLevel2Inf(level2InferenceData[m].cold));
+
+ const wakeTraceLevel2Inf = {
+ x: modelsLevel2Inf.map(m => level2InferenceData[m].name),
+ y: wakeStatsLevel2Inf.map(s => s.mean),
+ name: "Wake Mode (Level 2)",
+ type: "bar",
+ marker: { color: "#2ca02c" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: wakeStatsLevel2Inf.map(s => s.errorPlus),
+ arrayminus: wakeStatsLevel2Inf.map(s => s.errorMinus),
+ color: "#1a5e1a",
+ thickness: 2,
+ width: 6
+ },
+ text: wakeStatsLevel2Inf.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+ hovertemplate: "%{x}
Wake Mode: %{y:.2f}s"
+ };
+
+ const coldTraceLevel2Inf = {
+ x: modelsLevel2Inf.map(m => level2InferenceData[m].name),
+ y: coldStatsLevel2Inf.map(s => s.mean),
+ name: "Cold Start",
+ type: "bar",
+ marker: { color: "#d62728" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: coldStatsLevel2Inf.map(s => s.errorPlus),
+ arrayminus: coldStatsLevel2Inf.map(s => s.errorMinus),
+ color: "#8b1518",
+ thickness: 2,
+ width: 6
+ },
+ text: coldStatsLevel2Inf.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#d62728", weight: "bold" },
+ hovertemplate: "%{x}
Cold Start: %{y:.2f}s"
+ };
+
+ const speedupsLevel2Inf = wakeStatsLevel2Inf.map((w, i) => {
+ const reduction = ((coldStatsLevel2Inf[i].mean - w.mean) / coldStatsLevel2Inf[i].mean * 100).toFixed(0);
+ return reduction;
+ });
+
+ Plotly.newPlot("plotly-level2-inference", [wakeTraceLevel2Inf, coldTraceLevel2Inf], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Inference Time (seconds)",
+ range: [0, Math.max(...coldStatsLevel2Inf.map(s => s.mean + s.errorPlus)) * 1.2]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: modelsLevel2Inf.map((m, i) => ({
+ x: level2InferenceData[m].name,
+ y: coldStatsLevel2Inf[i].mean + coldStatsLevel2Inf[i].errorPlus + 0.4,
+ text: `${speedupsLevel2Inf[i]}% faster`,
+ showarrow: false,
+ font: { size: 11, color: "#2ca02c", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js b/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js
new file mode 100644
index 0000000..87d7c18
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Level 2 switching data
+ const level2SwitchingData = {
+ "ModelA": {
+ name: "Qwen3-0.6B",
+ wake: [0.91, 0.78, 0.85],
+ cold: [38.53, 37.21, 38.15]
+ },
+ "ModelB": {
+ name: "Phi-3-vision-128k",
+ wake: [2.55, 2.62, 2.58],
+ cold: [58.52, 57.65, 58.2]
+ }
+ };
+
+ function calcStatsLevel2Switch(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ const modelsLevel2Switch = Object.keys(level2SwitchingData);
+ const wakeStatsLevel2Switch = modelsLevel2Switch.map(m => calcStatsLevel2Switch(level2SwitchingData[m].wake));
+ const coldStatsLevel2Switch = modelsLevel2Switch.map(m => calcStatsLevel2Switch(level2SwitchingData[m].cold));
+
+ const wakeTraceLevel2Switch = {
+ x: modelsLevel2Switch.map(m => level2SwitchingData[m].name),
+ y: wakeStatsLevel2Switch.map(s => s.mean),
+ name: "Wake from Sleep (Level 2)",
+ type: "bar",
+ marker: { color: "#2ca02c" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: wakeStatsLevel2Switch.map(s => s.errorPlus),
+ arrayminus: wakeStatsLevel2Switch.map(s => s.errorMinus),
+ color: "#1a5e1a",
+ thickness: 2,
+ width: 6
+ },
+ text: wakeStatsLevel2Switch.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+ hovertemplate: "%{x}
Wake Time: %{y:.2f}s"
+ };
+
+ const coldTraceLevel2Switch = {
+ x: modelsLevel2Switch.map(m => level2SwitchingData[m].name),
+ y: coldStatsLevel2Switch.map(s => s.mean),
+ name: "Cold Start (Fresh Load)",
+ type: "bar",
+ marker: { color: "#d62728" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: coldStatsLevel2Switch.map(s => s.errorPlus),
+ arrayminus: coldStatsLevel2Switch.map(s => s.errorMinus),
+ color: "#8b1518",
+ thickness: 2,
+ width: 6
+ },
+ text: coldStatsLevel2Switch.map(s => s.mean.toFixed(1) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#d62728", weight: "bold" },
+ hovertemplate: "%{x}
Cold Start: %{y:.2f}s"
+ };
+
+ const speedupsLevel2Switch = wakeStatsLevel2Switch.map((w, i) => {
+ const speedup = (coldStatsLevel2Switch[i].mean / w.mean).toFixed(0);
+ return speedup;
+ });
+
+ Plotly.newPlot("plotly-level2-switching", [wakeTraceLevel2Switch, coldTraceLevel2Switch], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Switching Time (seconds)",
+ range: [0, Math.max(...coldStatsLevel2Switch.map(s => s.mean + s.errorPlus)) * 1.15]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: modelsLevel2Switch.map((m, i) => ({
+ x: level2SwitchingData[m].name,
+ y: coldStatsLevel2Switch[i].mean + coldStatsLevel2Switch[i].errorPlus + 3,
+ text: `${speedupsLevel2Switch[i]}x faster`,
+ showarrow: false,
+ font: { size: 11, color: "#2ca02c", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js
new file mode 100644
index 0000000..9de47fc
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js
@@ -0,0 +1,154 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Sleep Levels Comparison timing data
+ const timingDataLevelsComp = {
+ "Sleep Mode (Level 1)": [
+ { event: "A Model Load", duration: 36.27 },
+ { event: "A Model Warm Up", duration: 2.53 },
+ { event: "B Model Load", duration: 58.24 },
+ { event: "B Model Warm Up", duration: 5.95 },
+ { event: "A Model Wake up", duration: 0.25 },
+ { event: "A Model Prompt", duration: 0.43 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.82 },
+ { event: "B Model Prompt", duration: 0.86 },
+ { event: "B Model Sleep", duration: 0.41 },
+ { event: "A Model Wake up", duration: 0.28 },
+ { event: "A Model Prompt", duration: 0.41 },
+ { event: "A Model Sleep", duration: 0.1 },
+ { event: "B Model Wake Up", duration: 0.82 },
+ { event: "B Model Prompt", duration: 0.71 },
+ { event: "B Model Sleep", duration: 0.42 },
+ { event: "A Model Wake up", duration: 0.25 },
+ { event: "A Model Prompt", duration: 0.45 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 0.83 },
+ { event: "B Model Prompt", duration: 0.71 }
+ ],
+ "Sleep Mode (Level 2)": [
+ { event: "A Model Load", duration: 38.55 },
+ { event: "A Model Warm Up", duration: 2.53 },
+ { event: "B Model Load", duration: 61.23 },
+ { event: "B Model Warm Up", duration: 5.75 },
+ { event: "A Model Wake up", duration: 0.91 },
+ { event: "A Model Prompt", duration: 0.68 },
+ { event: "A Model Sleep", duration: 0.13 },
+ { event: "B Model Wake Up", duration: 2.55 },
+ { event: "B Model Prompt", duration: 0.78 },
+ { event: "B Model Sleep", duration: 0.46 },
+ { event: "A Model Wake up", duration: 0.78 },
+ { event: "A Model Prompt", duration: 0.46 },
+ { event: "A Model Sleep", duration: 0.12 },
+ { event: "B Model Wake Up", duration: 2.62 },
+ { event: "B Model Prompt", duration: 0.77 },
+ { event: "B Model Sleep", duration: 0.45 },
+ { event: "A Model Wake up", duration: 0.85 },
+ { event: "A Model Prompt", duration: 0.44 },
+ { event: "A Model Sleep", duration: 0.09 },
+ { event: "B Model Wake Up", duration: 2.58 },
+ { event: "B Model Prompt", duration: 0.72 }
+ ],
+ "WITHOUT Sleep Mode": [
+ { event: "A Model Load", duration: 38.53 },
+ { event: "A Model Prompt", duration: 4.66 },
+ { event: "B Model Load", duration: 58.52 },
+ { event: "B Model Prompt", duration: 6.55 },
+ { event: "A Model Load", duration: 37.21 },
+ { event: "A Model Prompt", duration: 3.8 },
+ { event: "B Model Load", duration: 57.65 },
+ { event: "B Model Prompt", duration: 6.21 },
+ { event: "A Model Load", duration: 38.15 },
+ { event: "A Model Prompt", duration: 2.56 },
+ { event: "B Model Load", duration: 58.2 },
+ { event: "B Model Prompt", duration: 6.15 }
+ ]
+ };
+
+ // Convert to segment format
+ function createSegmentsLevelsComp(timingData) {
+ const segments = [];
+
+ Object.entries(timingData).forEach(([scenario, events]) => {
+ let cumulativeTime = 0;
+
+ events.forEach(({ event, duration }) => {
+ const [who, ...stageParts] = event.split(' ');
+ const stage = stageParts.join(' ');
+
+ let action, category;
+ if (stage.includes('Load')) {
+ action = 'Load';
+ category = `${who} Load`;
+ } else if (stage.includes('Wake')) {
+ action = 'Wake';
+ category = `${who} Wake`;
+ } else if (stage.includes('Prompt')) {
+ action = 'Prompt';
+ category = `${who} Prompt`;
+ } else if (stage.includes('Sleep')) {
+ action = 'Sleep';
+ category = `${who} Sleep`;
+ } else if (stage.includes('Warm')) {
+ action = 'Load';
+ category = `${who} Load`;
+ }
+
+ segments.push({
+ scenario,
+ who,
+ stage,
+ action,
+ start: cumulativeTime,
+ end: cumulativeTime + duration,
+ duration,
+ category
+ });
+
+ cumulativeTime += duration;
+ });
+ });
+
+ return segments;
+ }
+
+ const segmentsLevelsComp = createSegmentsLevelsComp(timingDataLevelsComp);
+ const colorMapLevelsComp = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+ const categoriesLevelsComp = Object.keys(colorMapLevelsComp);
+
+ const xLevelsComp = segmentsLevelsComp.map(d => d.duration);
+ const baseLevelsComp = segmentsLevelsComp.map(d => d.start);
+ const yLevelsComp = segmentsLevelsComp.map(d => d.scenario);
+ const colorsLevelsComp = segmentsLevelsComp.map(d => colorMapLevelsComp[d.category]);
+ const customLevelsComp = segmentsLevelsComp.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+ const barsLevelsComp = {
+ type: "bar",
+ orientation: "h",
+ x: xLevelsComp, base: baseLevelsComp, y: yLevelsComp,
+ marker: { color: colorsLevelsComp, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+ hovertemplate:
+ "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+
+ "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+
+ "%{x:.2f}s",
+ customdata: customLevelsComp,
+ showlegend: false
+ };
+
+ const legendTracesLevelsComp = categoriesLevelsComp.map(name => ({
+ type: "scatter", mode: "markers", x:[null], y:[null],
+ name, marker: {color: colorMapLevelsComp[name], size: 10},
+ hoverinfo:"skip", showlegend:true
+ }));
+
+ Plotly.newPlot("plotly-sleep-levels-comparison", [barsLevelsComp, ...legendTracesLevelsComp], {
+ barmode: "overlay",
+ bargap: 0.05,
+ margin: {l: 160, r: 30, t: 20, b: 40},
+ xaxis: { title: "Time (seconds)", range: [0, 365] },
+ yaxis: {
+ categoryorder: "array",
+ categoryarray: ["WITHOUT Sleep Mode", "Sleep Mode (Level 2)", "Sleep Mode (Level 1)"]
+ },
+ hovermode: "closest",
+ dragmode: "pan"
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js
new file mode 100644
index 0000000..d029412
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js
@@ -0,0 +1,134 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // A4000 GPU timing data
+ const timingDataA4000 = {
+ "WITH Sleep Mode (L1)": [
+ { event: "A Model Load", duration: 21.01 },
+ { event: "A Model Warm up", duration: 2.49 },
+ { event: "B Model Load", duration: 46.01 },
+ { event: "B Model Warm up", duration: 7.37 },
+ { event: "A Model Wake up", duration: 0.11 },
+ { event: "A Model Prompt", duration: 0.44 },
+ { event: "A Model Sleep", duration: 0.13 },
+ { event: "B Model Wake Up", duration: 0.8 },
+ { event: "B Model Prompt", duration: 2.04 },
+ { event: "B Model Sleep", duration: 0.68 },
+ { event: "A Model Wake up", duration: 0.1 },
+ { event: "A Model Prompt", duration: 0.43 },
+ { event: "A Model Sleep", duration: 0.13 },
+ { event: "B Model Wake Up", duration: 0.8 },
+ { event: "B Model Prompt", duration: 1.73 },
+ { event: "B Model Sleep", duration: 0.68 },
+ { event: "A Model Wake up", duration: 0.1 },
+ { event: "A Model Prompt", duration: 0.43 },
+ { event: "A Model Sleep", duration: 0.13 },
+ { event: "B Model Wake Up", duration: 0.8 },
+ { event: "B Model Prompt", duration: 1.61 }
+ ],
+ "WITHOUT Sleep Mode": [
+ { event: "A Model Load", duration: 21.04 },
+ { event: "A Model Prompt", duration: 2.64 },
+ { event: "B Model Load", duration: 46.01 },
+ { event: "B Model Prompt", duration: 9.78 },
+ { event: "A Model Load", duration: 20.98 },
+ { event: "A Model Prompt", duration: 2.5 },
+ { event: "B Model Load", duration: 46.02 },
+ { event: "B Model Prompt", duration: 9.01 },
+ { event: "A Model Load", duration: 20.98 },
+ { event: "A Model Prompt", duration: 2.63 },
+ { event: "B Model Load", duration: 46.02 },
+ { event: "B Model Prompt", duration: 9.79 }
+ ]
+ };
+
+ // Convert simplified data to full segment format
+ function createSegmentsA4000(timingData) {
+ const segments = [];
+
+ Object.entries(timingData).forEach(([scenario, events]) => {
+ let cumulativeTime = 0;
+
+ events.forEach(({ event, duration }) => {
+ const [who, ...stageParts] = event.split(' ');
+ const stage = stageParts.join(' ');
+
+ // Determine action and category from stage
+ let action, category;
+ if (stage.includes('Load')) {
+ action = 'Load';
+ category = `${who} Load`;
+ } else if (stage.includes('Wake')) {
+ action = 'Wake';
+ category = `${who} Wake`;
+ } else if (stage.includes('Prompt')) {
+ action = 'Prompt';
+ category = `${who} Prompt`;
+ } else if (stage.includes('Sleep')) {
+ action = 'Sleep';
+ category = `${who} Sleep`;
+ } else if (stage.includes('Warm up')) {
+ action = 'Load';
+ category = `${who} Load`;
+ }
+
+ segments.push({
+ scenario,
+ who,
+ stage,
+ action,
+ start: cumulativeTime,
+ end: cumulativeTime + duration,
+ duration,
+ category
+ });
+
+ cumulativeTime += duration;
+ });
+ });
+
+ return segments;
+ }
+
+ const segmentsA4000 = createSegmentsA4000(timingDataA4000);
+ const colorMapA4000 = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+ const categoriesA4000 = Object.keys(colorMapA4000);
+
+ // Build arrays for a single stacked-horizontal bar trace using "base"
+ const xA4000 = segmentsA4000.map(d => d.duration);
+ const baseA4000 = segmentsA4000.map(d => d.start);
+ const yA4000 = segmentsA4000.map(d => d.scenario);
+ const colorsA4000 = segmentsA4000.map(d => colorMapA4000[d.category]);
+ const customA4000 = segmentsA4000.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+ const barsA4000 = {
+ type: "bar",
+ orientation: "h",
+ x: xA4000, base: baseA4000, y: yA4000,
+ marker: { color: colorsA4000, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+ hovertemplate:
+ "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+
+ "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+
+ "%{x:.2f}s",
+ customdata: customA4000,
+ showlegend: false
+ };
+
+ // Legend-only dummies to produce a clean 8-item legend
+ const legendTracesA4000 = categoriesA4000.map(name => ({
+ type: "scatter", mode: "markers", x:[null], y:[null],
+ name, marker: {color: colorMapA4000[name], size: 10},
+ hoverinfo:"skip", showlegend:true
+ }));
+
+ Plotly.newPlot("plotly-sleep-mode-a4000", [barsA4000, ...legendTracesA4000], {
+ barmode: "overlay",
+ bargap: 0.05,
+ margin: {l: 140, r: 30, t: 20, b: 40},
+ xaxis: { title: "Time (seconds)", range: [0, 235] },
+ yaxis: {
+ categoryorder: "array",
+ categoryarray: ["WITHOUT Sleep Mode", "WITH Sleep Mode (L1)"]
+ },
+ hovermode: "closest",
+ dragmode: "pan"
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js
new file mode 100644
index 0000000..ef92aa3
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js
@@ -0,0 +1,131 @@
+document.addEventListener('DOMContentLoaded', function() {
+ const timingData = {
+ "WITH Sleep Mode (L1)": [
+ { event: "A Model Load", duration: 97.61 },
+ { event: "A Model Warm up", duration: 2.38 },
+ { event: "B Model Load", duration: 47.63 },
+ { event: "B Model Warm up", duration: 2.42 },
+ { event: "A Model Wake up", duration: 5.66 },
+ { event: "A Model Prompt", duration: 1.8 },
+ { event: "A Model Sleep", duration: 6.01 },
+ { event: "B Model Wake Up", duration: 2.89 },
+ { event: "B Model Prompt", duration: 1 },
+ { event: "B Model Sleep", duration: 2.78 },
+ { event: "A Model Wake up", duration: 5.29 },
+ { event: "A Model Prompt", duration: 1.7 },
+ { event: "A Model Sleep", duration: 5.78 },
+ { event: "B Model Wake Up", duration: 2.86 },
+ { event: "B Model Prompt", duration: 0.93 },
+ { event: "B Model Sleep", duration: 2.78 },
+ { event: "A Model Wake up", duration: 5.27 },
+ { event: "A Model Prompt", duration: 0.92 },
+ { event: "A Model Sleep", duration: 5.89 },
+ { event: "B Model Wake Up", duration: 2.85 },
+ { event: "B Model Prompt", duration: 0.54 }
+ ],
+ "WITHOUT Sleep Mode": [
+ { event: "A Model Load", duration: 97.9 },
+ { event: "A Model Prompt", duration: 3.8 },
+ { event: "B Model Load", duration: 47.33 },
+ { event: "B Model Prompt", duration: 3.7 },
+ { event: "A Model Load", duration: 97.4 },
+ { event: "A Model Prompt", duration: 3.7 },
+ { event: "B Model Load", duration: 47.47 },
+ { event: "B Model Prompt", duration: 2.9 },
+ { event: "A Model Load", duration: 97.71 },
+ { event: "A Model Prompt", duration: 3.72 },
+ { event: "B Model Load", duration: 47.46 },
+ { event: "B Model Prompt", duration: 2.45 }
+ ]
+ };
+
+ function createSegments(timingData) {
+ const segments = [];
+
+ Object.entries(timingData).forEach(([scenario, events]) => {
+ let cumulativeTime = 0;
+
+ events.forEach(({ event, duration }) => {
+ const [who, ...stageParts] = event.split(' ');
+ const stage = stageParts.join(' ');
+
+ // Determine action and category from stage
+ let action, category;
+ if (stage.includes('Load')) {
+ action = 'Load';
+ category = `${who} Load`;
+ } else if (stage.includes('Wake')) {
+ action = 'Wake';
+ category = `${who} Wake`;
+ } else if (stage.includes('Prompt')) {
+ action = 'Prompt';
+ category = `${who} Prompt`;
+ } else if (stage.includes('Sleep')) {
+ action = 'Sleep';
+ category = `${who} Sleep`;
+ } else if (stage.includes('Warm up')) {
+ action = 'Load';
+ category = `${who} Load`;
+ }
+
+ segments.push({
+ scenario,
+ who,
+ stage,
+ action,
+ start: cumulativeTime,
+ end: cumulativeTime + duration,
+ duration,
+ category
+ });
+
+ cumulativeTime += duration;
+ });
+ });
+
+ return segments;
+ }
+
+ const segments = createSegments(timingData);
+ const colorMap = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"};
+ const categories = Object.keys(colorMap);
+
+ // Build arrays for a single stacked-horizontal bar trace using "base"
+ const x = segments.map(d => d.duration);
+ const base = segments.map(d => d.start);
+ const y = segments.map(d => d.scenario);
+ const colors = segments.map(d => colorMap[d.category]);
+ const custom = segments.map(d => [d.scenario, d.category, d.stage, d.start, d.end]);
+
+ const bars = {
+ type: "bar",
+ orientation: "h",
+ x, base, y,
+ marker: { color: colors, line: {width:1, color:"rgba(0,0,0,0.35)"} },
+ hovertemplate:
+ "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+
+ "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+
+ "%{x:.2f}s",
+ customdata: custom,
+ showlegend: false
+ };
+
+ const legendTraces = categories.map(name => ({
+ type: "scatter", mode: "markers", x:[null], y:[null],
+ name, marker: {color: colorMap[name], size: 10},
+ hoverinfo:"skip", showlegend:true
+ }));
+
+ Plotly.newPlot("plotly-sleep-mode", [bars, ...legendTraces], {
+ barmode: "overlay",
+ bargap: 0.05,
+ margin: {l: 140, r: 30, t: 20, b: 40},
+ xaxis: { title: "Time (seconds)", range: [0, 478.32] },
+ yaxis: {
+ categoryorder: "array",
+ categoryarray: ["WITHOUT Sleep Mode", "WITH Sleep Mode (L1)"]
+ },
+ hovermode: "closest",
+ dragmode: "pan"
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js
new file mode 100644
index 0000000..4013f62
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js
@@ -0,0 +1,104 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // A4000 Switching data
+ const switchingDataA4000 = {
+ "ModelA": {
+ name: "Qwen3-0.6B",
+ wake: [0.11, 0.1, 0.1],
+ cold: [21.04, 20.98, 20.98]
+ },
+ "ModelB": {
+ name: "Phi-3-vision-128k(4B)",
+ wake: [0.8, 0.8, 0.8],
+ cold: [46.01, 46.02, 46.02]
+ }
+ };
+
+ function calcStatsSwitchA4000(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ const modelsSwitchA4000 = Object.keys(switchingDataA4000);
+ const wakeStatsSwitchA4000 = modelsSwitchA4000.map(m => calcStatsSwitchA4000(switchingDataA4000[m].wake));
+ const coldStatsSwitchA4000 = modelsSwitchA4000.map(m => calcStatsSwitchA4000(switchingDataA4000[m].cold));
+
+ const wakeTraceSwitchA4000 = {
+ x: modelsSwitchA4000.map(m => switchingDataA4000[m].name),
+ y: wakeStatsSwitchA4000.map(s => s.mean),
+ name: "Wake from Sleep",
+ type: "bar",
+ marker: { color: "#2ca02c" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: wakeStatsSwitchA4000.map(s => s.errorPlus),
+ arrayminus: wakeStatsSwitchA4000.map(s => s.errorMinus),
+ color: "#1a5e1a",
+ thickness: 2,
+ width: 6
+ },
+ text: wakeStatsSwitchA4000.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+ hovertemplate: "%{x}
Wake Time: %{y:.2f}s"
+ };
+
+ const coldTraceSwitchA4000 = {
+ x: modelsSwitchA4000.map(m => switchingDataA4000[m].name),
+ y: coldStatsSwitchA4000.map(s => s.mean),
+ name: "Cold Start (Fresh Load)",
+ type: "bar",
+ marker: { color: "#d62728" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: coldStatsSwitchA4000.map(s => s.errorPlus),
+ arrayminus: coldStatsSwitchA4000.map(s => s.errorMinus),
+ color: "#8b1518",
+ thickness: 2,
+ width: 6
+ },
+ text: coldStatsSwitchA4000.map(s => s.mean.toFixed(1) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#d62728", weight: "bold" },
+ hovertemplate: "%{x}
Cold Start: %{y:.2f}s"
+ };
+
+ const speedupsSwitchA4000 = wakeStatsSwitchA4000.map((w, i) => {
+ const speedup = (coldStatsSwitchA4000[i].mean / w.mean).toFixed(0);
+ return speedup;
+ });
+
+ Plotly.newPlot("plotly-switching-a4000", [wakeTraceSwitchA4000, coldTraceSwitchA4000], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Switching Time (seconds)",
+ range: [0, Math.max(...coldStatsSwitchA4000.map(s => s.mean + s.errorPlus)) * 1.15]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: modelsSwitchA4000.map((m, i) => ({
+ x: switchingDataA4000[m].name,
+ y: coldStatsSwitchA4000[i].mean + coldStatsSwitchA4000[i].errorPlus + 3,
+ text: `${speedupsSwitchA4000[i]}x faster`,
+ showarrow: false,
+ font: { size: 11, color: "#2ca02c", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js
new file mode 100644
index 0000000..3130701
--- /dev/null
+++ b/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js
@@ -0,0 +1,107 @@
+document.addEventListener('DOMContentLoaded', function() {
+ // Raw data: Wake Time vs Cold Start Time
+ const switchingData = {
+ "ModelA": {
+ name: "Qwen3-235B-A22B (TP=4)",
+ wake: [5.66, 5.29, 5.27],
+ cold: [97.9, 97.4, 97.71]
+ },
+ "ModelB": {
+ name: "Qwen3-Coder-30B (TP=1)",
+ wake: [2.89, 2.86, 2.85],
+ cold: [47.33, 47.47, 47.46]
+ }
+ };
+
+ // Calculate mean and error bars for each model
+ function calcStatsSwitch(values) {
+ const mean = values.reduce((a, b) => a + b, 0) / values.length;
+ const min = Math.min(...values);
+ const max = Math.max(...values);
+ return { mean, errorMinus: mean - min, errorPlus: max - mean };
+ }
+
+ // Prepare traces for both models
+ const modelsSwitch = Object.keys(switchingData);
+ const wakeStatsSwitch = modelsSwitch.map(m => calcStatsSwitch(switchingData[m].wake));
+ const coldStatsSwitch = modelsSwitch.map(m => calcStatsSwitch(switchingData[m].cold));
+
+ const wakeTraceSwitch = {
+ x: modelsSwitch.map(m => switchingData[m].name),
+ y: wakeStatsSwitch.map(s => s.mean),
+ name: "Wake from Sleep",
+ type: "bar",
+ marker: { color: "#2ca02c" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: wakeStatsSwitch.map(s => s.errorPlus),
+ arrayminus: wakeStatsSwitch.map(s => s.errorMinus),
+ color: "#1a5e1a",
+ thickness: 2,
+ width: 6
+ },
+ text: wakeStatsSwitch.map(s => s.mean.toFixed(2) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#2ca02c", weight: "bold" },
+ hovertemplate: "%{x}
Wake Time: %{y:.2f}s"
+ };
+
+ const coldTraceSwitch = {
+ x: modelsSwitch.map(m => switchingData[m].name),
+ y: coldStatsSwitch.map(s => s.mean),
+ name: "Cold Start (Fresh Load)",
+ type: "bar",
+ marker: { color: "#d62728" },
+ error_y: {
+ type: "data",
+ symmetric: false,
+ array: coldStatsSwitch.map(s => s.errorPlus),
+ arrayminus: coldStatsSwitch.map(s => s.errorMinus),
+ color: "#8b1518",
+ thickness: 2,
+ width: 6
+ },
+ text: coldStatsSwitch.map(s => s.mean.toFixed(1) + "s"),
+ textposition: "outside",
+ textfont: { size: 12, color: "#d62728", weight: "bold" },
+ hovertemplate: "%{x}
Cold Start: %{y:.2f}s"
+ };
+
+ // Calculate speedup multiples for annotation
+ const speedupsSwitch = wakeStatsSwitch.map((w, i) => {
+ const speedup = (coldStatsSwitch[i].mean / w.mean).toFixed(0);
+ return speedup;
+ });
+
+ Plotly.newPlot("plotly-switching-comparison", [wakeTraceSwitch, coldTraceSwitch], {
+ barmode: "group",
+ bargap: 0.15,
+ bargroupgap: 0.1,
+ margin: { l: 60, r: 30, t: 40, b: 50 },
+ xaxis: {
+ title: "",
+ tickangle: 0
+ },
+ yaxis: {
+ title: "Switching Time (seconds)",
+ range: [0, Math.max(...coldStatsSwitch.map(s => s.mean + s.errorPlus)) * 1.15]
+ },
+ hovermode: "closest",
+ legend: {
+ x: 0.5,
+ y: 1.15,
+ xanchor: "center",
+ yanchor: "top",
+ orientation: "h"
+ },
+ annotations: modelsSwitch.map((m, i) => ({
+ x: switchingData[m].name,
+ y: coldStatsSwitch[i].mean + coldStatsSwitch[i].errorPlus + 5,
+ text: `${speedupsSwitch[i]}x faster`,
+ showarrow: false,
+ font: { size: 11, color: "#2ca02c", weight: "bold" },
+ xanchor: "center"
+ }))
+ }, {displayModeBar: true, responsive: true});
+});
diff --git a/assets/figures/2025-vllm-sleep-mode/sleepmode.png b/assets/figures/2025-vllm-sleep-mode/sleepmode.png
new file mode 100644
index 0000000..4a918ec
Binary files /dev/null and b/assets/figures/2025-vllm-sleep-mode/sleepmode.png differ