diff --git a/_posts/2025-10-26-sleep-mode.md b/_posts/2025-10-26-sleep-mode.md new file mode 100644 index 0000000..316207f --- /dev/null +++ b/_posts/2025-10-26-sleep-mode.md @@ -0,0 +1,471 @@ +--- +layout: post +title: "Zero-Reload Model Switching with vLLM Sleep Mode" +author: "Embedded LLM" +image: /assets/figures/2025-vllm-sleep-mode/sleepmode.png +thumbnail-img: /assets/figures/2025-vllm-sleep-mode/sleepmode.png +share-img: /assets/figures/2025-vllm-sleep-mode/sleepmode.png +--- + +## Introduction + +**The multi-model serving problem:** You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff: + +1. **Keep both models loaded** → Requires 2x the GPU memory (expensive, often impossible) +2. **Reload models on-demand** → 30-100+ seconds per switch (slow, wasteful) + +![vLLM Sleep Mode](/assets/figures/2025-vllm-sleep-mode/sleepmode.png) + +**vLLM Sleep Mode offers a third way:** Models hibernate in seconds and wake up fast—delivering the efficiency of on-demand loading with the speed of persistent serving. + +### Two Sleep Levels for Different Needs + +- **Level 1:** Offloads weights to CPU RAM (fast wake time) +- **Level 2:** Discards weights entirely (nearly as fast wake time, minimal RAM usage) + +Both levels are **18-200x faster** than full reload and work seamlessly with Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP). + +### Why Sleep Mode Beats Fast Weight Loaders + +Even with instant weight loading, every cold start pays hidden costs that Sleep Mode avoids: + +| Cost | Description | Fast Weight Loaders | Sleep Mode | +|------|-------------|---------------------|------------| +| 1. VRAM load time | Copying weights to GPU | ✅ Optimized | ✅ Preserved | +| 2. Memory allocator setup | CUDA allocator initialization | ❌ Every time | ✅ Preserved | +| 3. CUDA graph capture | Record execution graphs | ❌ Every time | ✅ Preserved | +| 4. GPU kernel JIT compilation | DeepGEMM, FlashInfer, TorchInductor | ❌ Every time | ✅ Preserved (after initial warmup) | +| 5. Cache warm-up | First-request overhead | ❌ Every time | ⚡ Quick re-warm | + +By keeping the process alive, Sleep Mode preserves infrastructure (#2-4) and avoids expensive reinitialization. This is why benchmarks show **Sleep Mode inference is 61-88% faster** than cold starts. + +**This post covers:** +- Comprehensive benchmarks across model sizes (0.6B to 235B) and GPUs (A4000 to A100) +- Technical deep-dives explaining the performance gains +- Ablation studies on warm-up impact and FP8 quantization +- Decision guide for choosing the right sleep level + +## Quick Start: Using Sleep Mode + +### Online Serving API + +Start two vLLM servers with Sleep Mode enabled: + +```bash +# Terminal 1: Start Phi-3-vision +export VLLM_SERVER_DEV_MODE=1 +vllm serve microsoft/Phi-3-vision-128k-instruct --enable-sleep-mode --port 8001 + +# Terminal 2: Start Qwen3-0.6B +export VLLM_SERVER_DEV_MODE=1 +vllm serve Qwen/Qwen3-0.6B --enable-sleep-mode --port 8002 +``` + +### Sleep and Wake Models + +```bash +# Put Phi-3-vision to sleep (Level 2 - minimal RAM usage) +curl -X POST 'localhost:8001/sleep?level=2' + +# Put Qwen3-0.6B to sleep (Level 2) +curl -X POST 'localhost:8002/sleep?level=2' + +# Wake up Phi-3-vision for inference +curl -X POST 'localhost:8001/wake_up' +curl -X POST 'localhost:8001/collective_rpc' \ + -H 'Content-Type: application/json' \ + -d '{"method":"reload_weights"}' + +# IMPORTANT: Reset prefix cache after waking (Level 2 only) +curl -X POST 'localhost:8001/reset_prefix_cache' + +# Now run inference on Phi-3-vision... +# (your inference requests here) + +# Put back to sleep when done +curl -X POST 'localhost:8001/sleep?level=2' + +# Wake up Qwen3-0.6B +curl -X POST 'localhost:8002/wake_up' +# (Level 1 doesn't need reload_weights or reset_prefix_cache) + +# Run inference on Qwen3-0.6B... +``` + +> [!NOTE] +> For Level 2 sleep, you must call `reload_weights` and `reset_prefix_cache` after waking. Level 1 sleep doesn't require these extra steps. + +> [!WARNING] +> **Security:** The `/sleep`, `/wake_up`, `/collective_rpc`, and `/reset_prefix_cache` endpoints require `VLLM_SERVER_DEV_MODE=1` and should only be exposed in trusted networks. These administrative endpoints can disrupt service and are intended for closed environments like training clusters or backend applications. + +## Performance Overview + +Let's see how Sleep Mode performs compared to traditional model reloading. + +### Sleep Mode L1 vs No Sleep Mode Performance + +The interactive chart below shows the **total time to perform 5 model switches**: running inference on Model A, switching to Model B, running inference on Model B, then repeating this pattern (A→B→A→B→A→B). + +**With Sleep Mode:** Models sleep/wake between switches, preserving infrastructure. +**Without Sleep Mode:** Each switch requires a full vLLM restart and reload. + +
+ +
+
+ Model A: Qwen3-235B-A22B-Instruct-2507-FP8 (TP=4) | Model B: Qwen3-Coder-30B-A3B-Instruct (TP=1)
+ GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ +
+ +
+ +## Inference Performance Boost + +Beyond faster model switching, Sleep Mode also delivers **faster inference times**. Because models are already warmed up when woken from sleep, they skip the cold start overhead that affects freshly loaded models. + +
+
+
+ Inference time comparison showing wake mode (already warmed up) vs cold start (just loaded).
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +#### Why Sleep Mode Improves Inference Speed + +The 61-88% inference speedup isn't from faster weight loading—it's from **preserving expensive infrastructure** that cold starts must rebuild from scratch. + +**What Sleep Mode Preserves:** + +| Component | Preserved? | Cold Start Must Pay | +|-----------|-----------|---------------------| +| Memory allocator (CuMemAllocator) | ✅ Yes | ❌ Reinitialize every time | +| CUDA graphs | ✅ Yes | ❌ Re-capture every time | +| Process state (Python, CUDA context) | ✅ Yes | ❌ Restart every time | +| GPU kernel JIT cache | ✅ Yes (after initial warmup) | ❌ Recompile every time | + +**The Critical Difference:** + +- **Without Sleep Mode:** Process dies on unload → **You CANNOT benefit from pre-warm-up** + - Must restart Python process and CUDA context + - Must reinitialize memory allocator + - Must re-capture CUDA graphs + - Must re-JIT compile kernels (DeepGEMM, FlashInfer, TorchInductor) + - **Result:** First inference is **4-7x slower** (see benchmarks: 0.92s wake vs 3.72s cold start) + +- **With Sleep Mode:** Process stays alive → **Pre-warm-up pays off** + - ✅ Allocator, graphs, process state, and JIT kernels all preserved after initial warmup + - **Result:** First inference stays fast (~1s), avoiding the 3-4s cold start penalty + +> [!NOTE] +> Timing varies significantly by model size, GPU generation, and configuration. See the [Impact of Warm-Up](#impact-of-warm-up-on-sleep-mode) section for detailed measurements showing 5-7x slowdown without warm-up. + +## Model Switching Performance + +The most dramatic benefit of Sleep Mode is in model switching time. Waking a sleeping model is **18-20x faster** than loading a fresh vLLM instance. + +
+
+
+ Model switching time: Wake from sleep vs cold start (fresh load).
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +## Hardware Scalability: A4000 GPU Results + +Sleep Mode benefits aren't limited to high-end GPUs. Here's the same workload on an **A4000 GPU** with smaller models, demonstrating that the performance gains scale across different hardware tiers and model sizes. + +
+ +
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ +
+ +
+ +### A4000: Inference Performance + +
+
+
+ Inference time comparison on A4000: wake mode (already warmed up) vs cold start (just loaded).
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +### A4000: Model Switching Performance + +
+
+
+ Model switching time on A4000: Wake from sleep vs cold start (fresh load).
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A4000 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +**Key Observations on A4000:** +- **Inference Performance:** Wake mode delivers 83% faster inference for Qwen3-0.6B and 81% faster for Phi-3-vision +- **Model Switching:** Wake times are incredibly fast (~0.1-0.8s), achieving **58-203x speedup** vs cold starts +- **Total time savings: 62%** (85s vs 226s for 5 model switches) +- **Near-instant switching** for small models (0.1s wake time), making multi-model serving feel seamless +- Demonstrates that Sleep Mode is effective across different GPU classes and model sizes + +## Sleep Levels: Choosing the Right Mode + +vLLM Sleep Mode offers two levels with different tradeoffs: + +**Level 1 (Default):** Offloads model weights to CPU memory, discards KV cache +- **Fastest wake times** (~0.1-0.8s for small models, ~3-6s for large models) +- **Requires sufficient CPU RAM** to store model weights +- **Best for:** Systems with adequate CPU memory, frequent model switching + +**Level 2:** Discards model weights and KV cache, keeps only buffers (rope scaling tensors, etc.) in CPU +- **Slower wake times** (~0.8-2.6s for small models) due to weight reload from disk +- **Minimal CPU RAM usage** - only small buffers retained +- **Best for:** Systems with limited CPU RAM or when managing many models that won't all fit in memory + +### Performance Comparison: Level 1 vs Level 2 vs No Sleep + +
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ Comparing all three modes: Level 1 (fastest), Level 2 (minimal RAM), No Sleep. Hover for exact timing. +
+ +
+ +**Performance Summary:** + +| Mode | Total Time | Wake Time (A/B) | CPU RAM | Best For | +|------|------------|-----------------|---------|----------| +| **No Sleep** | 357.1s | N/A (full reload) | Minimal | Single model, no switching | +| **Level 1** | 112.6s | 0.26s / 0.82s | High (~GB per model) | Frequent switching, ample RAM | +| **Level 2** | 124.6s | 0.85s / 2.58s | Minimal (~MB per model) | Limited RAM, cost optimization | + +**Key Insights:** +- **Level 1 is fastest** (68% faster than no sleep) but needs significant CPU RAM +- **Level 2 is nearly as fast** (65% faster than no sleep) with minimal RAM requirements +- **Level 2 wake is ~3x slower than Level 1** (0.85s vs 0.26s for Qwen3-0.6B) due to weight reload +- Both sleep modes deliver **massive improvements** over no sleep mode + +#### Why Level 2 is Still Faster Than No Sleep Mode + +At first glance, this seems counterintuitive: **Level 2 reloads weights from SSD** (just like "No Sleep Mode"), so why is it **23-45x faster overall?** + +**The Answer: Weight loading is only ONE of FIVE costs** + +When you reload a model without Sleep Mode, you pay all these costs: + +| Cost | Level 2 | No Sleep Mode | +|------|---------|---------------| +| 1. Weight load (SSD → VRAM) | ❌ Must pay | ❌ Must pay | +| 2. Process initialization | ✅ **Skipped** | ❌ Must pay | +| 3. Memory allocator setup | ✅ **Skipped** | ❌ Must pay | +| 4. CUDA graph capture | ✅ **Skipped** | ❌ Must pay | +| 5. GPU kernel JIT compilation | ✅ **Preserved (already compiled)** | ❌ Full compilation + warm-up | + +**Level 2 Strategy:** +- Weight reload from SSD (same as No Sleep) +- **Everything else preserved:** Process state, allocator instance, CUDA graphs, and compiled JIT kernels all intact +- **No recompilation needed:** Kernels were compiled during initial warmup and remain cached +- **Average per switch: ~2.6s** (see benchmark data above) + +**No Sleep Mode Reality:** +- Weight reload from SSD (same as Level 2) +- **Everything else rebuilt:** Process restart + allocator init + graph re-capture +- **JIT kernels:** Full compilation + explicit warm-up routine (`kernel_warmup()` + dummy runs) +- **Average per switch: ~48s** (see benchmark data above) + +**The benchmark data proves it:** For 5 model switches: +- **Level 2:** 124.6s total (average ~2.6s per switch) +- **No Sleep:** 357.1s total (average ~48s per switch) + +Even though both reload weights from SSD, Level 2 is **2.9x faster overall** because it preserves the expensive infrastructure (process state, allocator, CUDA graphs) that No Sleep Mode must rebuild from scratch every single time. + +### Level 2: Inference Performance + +
+
+
+ Inference time comparison with Sleep Level 2: wake mode vs cold start.
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +### Level 2: Model Switching Performance + +
+
+
+ Model switching time with Sleep Level 2: wake from sleep vs cold start.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 2 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +**Key Observations:** + +| Metric | No Sleep | Level 2 | Improvement | +|--------|----------|---------|-------------| +| **Total Time (5 switches)** | 357.1s | 124.6s | **65% faster** | +| **Qwen3-0.6B Switch Time** | 37.6s avg | 0.85s avg | **45x faster** | +| **Phi-3-vision Switch Time** | 58.1s avg | 2.58s avg | **23x faster** | +| **Qwen3-0.6B Inference** | 3.67s avg | 0.53s avg | **86% faster** | +| **Phi-3-vision Inference** | 6.30s avg | 0.76s avg | **88% faster** | +| **Wake Time vs Level 1** | - | 3-10x slower | Trade CPU RAM for speed | + +**When to Use Level 2:** +- **Limited CPU RAM:** System cannot hold all model weights in CPU memory +- **Cost Optimization:** Cheaper cloud instances with less CPU RAM +- **Many Models:** Switching between many models where CPU memory is a constraint +- **Still Significant Gains:** Even with weight reload, Level 2 is 23-45x faster than no sleep mode + +**Level 1 vs Level 2 Comparison:** +- Level 1: ~0.1-0.8s wake time, needs ~10-100GB+ CPU RAM per model +- Level 2: ~0.8-2.6s wake time, needs only ~MB CPU RAM per model +- Both dramatically faster than full reload (~20-100s) + +## Ablation Studies + +### Impact of Warm-Up on Sleep Mode + +Does skipping the warm-up phase affect performance? Warm-up pre-compiles CUDA graphs during initial load, which can take several seconds. Let's compare with and without warm-up. + +
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ Comparing with warm-up (pre-compiled) vs without warm-up (lazy compilation). Hover for exact timing. +
+ +
+ +**Key Findings:** + +| Metric | With Warm-Up | Without Warm-Up | Difference | +|--------|--------------|-----------------|------------| +| **Initial Load Time** | 108.7s (includes 8.4s warm-up) | 101.1s (no warm-up) | 7.6s saved initially | +| **First Inference (A)** | 0.45s | 2.59s | **5.8x slower** without warm-up | +| **First Inference (B)** | 0.93s | 6.61s | **7.1x slower** without warm-up | +| **Subsequent Inferences** | 0.43s avg | 0.41s avg | No difference | +| **Total Time (5 switches)** | 119.5s | 119.0s | Nearly identical | + +**Insights:** +- **Warm-Up Compiles Kernels Once, Benefits All Wake Cycles:** With initial warmup, JIT compilation and CUDA graph capture happen once during load and are preserved across all subsequent sleep/wake cycles +- **Without Warm-Up, Every Wake-Up Pays Compilation Cost:** The 5-7x slowdown happens on the first inference after **every single wake-up**, not just once +- **Compiled Kernels Are Preserved Across Sleep/Wake:** After warmup during initial load (8.4s), all subsequent wake-ups have fast first inference (0.45s, 0.93s) proving kernels stay cached +- **Minimal Warmup Sufficient:** A single 1-token inference is enough to trigger full JIT compilation and CUDA graph capture, making warmup very cheap +- **Trade Initial Load Time for Consistent Performance:** The 8.4s warmup cost is paid once and amortized across all model switches +- **Recommendation: Always Use Warm-Up** for production workloads where consistent, fast inference is expected + +### Impact of Quantization on Sleep Mode + +Does quantization (FP8) affect Sleep Mode performance? We tested the same workload with and without FP8 quantization on A100 GPU. + +
+
+
+ Model A: Qwen3-0.6B | Model B: Phi-3-vision-128k-instruct
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE
+ Comparing BF16 (baseline) vs FP8 quantization. Hover for exact timing. +
+ +
+ +### Ablation: Inference Performance (BF16 vs FP8) + +
+
+
+ Inference time comparison: BF16 vs FP8 quantization with Sleep Mode.
+ Inference time = prefill + decode (first request after wake/load). Each request uses a different question to avoid caching, limited to 100 tokens output.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +### Ablation: Model Switching (BF16 vs FP8) + +
+
+
+ Model switching time: BF16 vs FP8 quantization with Sleep Mode.
+ Error bars show min/max variation across multiple runs. Values displayed on bars.
+ GPU: A100 (TP=1) | vLLM 0.11.0 | Sleep Level: 1 | Compilation: cudagraph_mode: FULL_AND_PIECEWISE +
+ +
+ +**Key Findings:** + +| Metric | BF16 | FP8 | Improvement | +|--------|------|-----|-------------| +| **Total Time (5 switches)** | 108.2s | 113.6s | -5% (slightly slower) | +| **Qwen3-0.6B Wake Time** | 0.27s avg | 0.18s avg | **33% faster** | +| **Phi-3-vision Wake Time** | 0.90s avg | 0.78s avg | **13% faster** | +| **Qwen3-0.6B Inference** | 0.41s avg | 0.44s avg | -7% (slightly slower) | +| **Phi-3-vision Inference** | 0.81s avg | 0.57s avg | **30% faster** | +| **Initial Load Time** | 90.5s | 96.9s | -7% (longer warmup) | + +**Insights:** +- **FP8 has faster wake operations** (13-33% faster) due to less memory movement +- **FP8 improves inference for larger models** (30% faster for Phi-3-vision) but shows minimal difference for tiny models +- **Initial load takes longer with FP8** due to quantization overhead during warmup +- **After initial load, FP8 provides smoother switching** with faster wake cycles +- For workloads with frequent switching, FP8's faster wake times can offset the longer initial load + +## Decision Guide: Which Sleep Level to Use? + +### Use Sleep Level 1 When: +- You have sufficient CPU RAM to hold all model weights +- You need the fastest possible wake times (0.1-6s) +- You're switching models very frequently (every few seconds/minutes) +- Inference latency consistency is critical + +### Use Sleep Level 2 When: +- CPU RAM is limited (can't hold all model weights) +- You're optimizing cloud costs (cheaper instances with less RAM) +- You have many models to manage (10+) + +### Skip Sleep Mode When: +- You're only using a single model (no switching needed) +- Model switches are extremely rare (once per day/week) +- Both models fit simultaneously in GPU memory + +## Conclusion + +vLLM Sleep Mode transforms multi-model GPU serving from a 30-100 second reload penalty into sub-second switches. The benchmarks speak for themselves: + +- **18-200x faster model switching** depending on model size and hardware +- **61-88% faster inference** for warmed models vs cold starts +- **65-68% total time savings** across complete workloads +- **Works at every scale:** 0.6B to 235B parameters, small and large GPUs + +The future of LLM serving is multi-model. Sleep Mode makes it practical today. + +## Acknowledgements + +Special thanks to **Vensen Mu**, **Jeff Aw**, **Jun Kang Chow**, **Tun Jian Tan**, **Pin Siang Tan**, **Amir Balwel**, **Ye Hur Cheong**, **Zhiyao Cen** and **Kaichao You** for developing the Sleep Mode feature and this blog post. diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js new file mode 100644 index 0000000..0f9b772 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-inference.js @@ -0,0 +1,91 @@ +document.addEventListener('DOMContentLoaded', function() { + // Ablation inference data: BF16 vs FP8 + const ablationInferenceData = { + "ModelA": { + name: "Qwen3-0.6B", + bf16: [0.41, 0.4, 0.41], + fp8: [0.43, 0.43, 0.45] + }, + "ModelB": { + name: "Phi-3-vision-128k", + bf16: [0.9, 0.74, 0.8], + fp8: [0.69, 0.59, 0.44] + } + }; + + function calcStatsAblInf(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + const modelsAblInf = Object.keys(ablationInferenceData); + const bf16StatsInf = modelsAblInf.map(m => calcStatsAblInf(ablationInferenceData[m].bf16)); + const fp8StatsInf = modelsAblInf.map(m => calcStatsAblInf(ablationInferenceData[m].fp8)); + + const bf16TraceInf = { + x: modelsAblInf.map(m => ablationInferenceData[m].name), + y: bf16StatsInf.map(s => s.mean), + name: "BF16", + type: "bar", + marker: { color: "#1f77b4" }, + error_y: { + type: "data", + symmetric: false, + array: bf16StatsInf.map(s => s.errorPlus), + arrayminus: bf16StatsInf.map(s => s.errorMinus), + color: "#0d4a6e", + thickness: 2, + width: 6 + }, + text: bf16StatsInf.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#1f77b4", weight: "bold" }, + hovertemplate: "%{x}
BF16: %{y:.2f}s" + }; + + const fp8TraceInf = { + x: modelsAblInf.map(m => ablationInferenceData[m].name), + y: fp8StatsInf.map(s => s.mean), + name: "FP8", + type: "bar", + marker: { color: "#ff7f0e" }, + error_y: { + type: "data", + symmetric: false, + array: fp8StatsInf.map(s => s.errorPlus), + arrayminus: fp8StatsInf.map(s => s.errorMinus), + color: "#cc6600", + thickness: 2, + width: 6 + }, + text: fp8StatsInf.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#ff7f0e", weight: "bold" }, + hovertemplate: "%{x}
FP8: %{y:.2f}s" + }; + + Plotly.newPlot("plotly-ablation-inference", [bf16TraceInf, fp8TraceInf], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Inference Time (seconds)", + range: [0, Math.max(...bf16StatsInf.map(s => s.mean + s.errorPlus), ...fp8StatsInf.map(s => s.mean + s.errorPlus)) * 1.25] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + } + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js new file mode 100644 index 0000000..85a4ec8 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-quant.js @@ -0,0 +1,140 @@ +document.addEventListener('DOMContentLoaded', function() { + // Ablation study: BF16 vs FP8 quantization + const timingDataAblation = { + "Sleep Mode (BF16)": [ + { event: "A Model Load", duration: 32.56 }, + { event: "A Model Warm Up", duration: 2.69 }, + { event: "B Model Load", duration: 57.96 }, + { event: "B Model Warm Up", duration: 5.92 }, + { event: "A Model Wake up", duration: 0.28 }, + { event: "A Model Prompt", duration: 0.41 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.89 }, + { event: "B Model Prompt", duration: 0.9 }, + { event: "B Model Sleep", duration: 0.48 }, + { event: "A Model Wake up", duration: 0.27 }, + { event: "A Model Prompt", duration: 0.4 }, + { event: "A Model Sleep", duration: 0.1 }, + { event: "B Model Wake Up", duration: 0.93 }, + { event: "B Model Prompt", duration: 0.74 }, + { event: "B Model Sleep", duration: 0.5 }, + { event: "A Model Wake up", duration: 0.27 }, + { event: "A Model Prompt", duration: 0.41 }, + { event: "A Model Sleep", duration: 0.1 }, + { event: "B Model Wake Up", duration: 0.88 }, + { event: "B Model Prompt", duration: 0.8 } + ], + "Sleep Mode (FP8)": [ + { event: "A Model Load", duration: 37.71 }, + { event: "A Model Warm Up", duration: 2.34 }, + { event: "B Model Load", duration: 57.79 }, + { event: "B Model Warm Up", duration: 6.37 }, + { event: "A Model Wake up", duration: 0.18 }, + { event: "A Model Prompt", duration: 0.43 }, + { event: "A Model Sleep", duration: 0.06 }, + { event: "B Model Wake Up", duration: 0.79 }, + { event: "B Model Prompt", duration: 0.69 }, + { event: "B Model Sleep", duration: 0.31 }, + { event: "A Model Wake up", duration: 0.19 }, + { event: "A Model Prompt", duration: 0.43 }, + { event: "A Model Sleep", duration: 0.06 }, + { event: "B Model Wake Up", duration: 0.77 }, + { event: "B Model Prompt", duration: 0.59 }, + { event: "B Model Sleep", duration: 0.31 }, + { event: "A Model Wake up", duration: 0.16 }, + { event: "A Model Prompt", duration: 0.45 }, + { event: "A Model Sleep", duration: 0.07 }, + { event: "B Model Wake Up", duration: 0.78 }, + { event: "B Model Prompt", duration: 0.44 } + ] + }; + + // Convert to segment format + function createSegmentsAblation(timingData) { + const segments = []; + + Object.entries(timingData).forEach(([scenario, events]) => { + let cumulativeTime = 0; + + events.forEach(({ event, duration }) => { + const [who, ...stageParts] = event.split(' '); + const stage = stageParts.join(' '); + + let action, category; + if (stage.includes('Load')) { + action = 'Load'; + category = `${who} Load`; + } else if (stage.includes('Wake')) { + action = 'Wake'; + category = `${who} Wake`; + } else if (stage.includes('Prompt')) { + action = 'Prompt'; + category = `${who} Prompt`; + } else if (stage.includes('Sleep')) { + action = 'Sleep'; + category = `${who} Sleep`; + } else if (stage.includes('Warm')) { + action = 'Load'; + category = `${who} Load`; + } + + segments.push({ + scenario, + who, + stage, + action, + start: cumulativeTime, + end: cumulativeTime + duration, + duration, + category + }); + + cumulativeTime += duration; + }); + }); + + return segments; + } + + const segmentsAblation = createSegmentsAblation(timingDataAblation); + const colorMapAblation = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"}; + const categoriesAblation = Object.keys(colorMapAblation); + + const xAblation = segmentsAblation.map(d => d.duration); + const baseAblation = segmentsAblation.map(d => d.start); + const yAblation = segmentsAblation.map(d => d.scenario); + const colorsAblation = segmentsAblation.map(d => colorMapAblation[d.category]); + const customAblation = segmentsAblation.map(d => [d.scenario, d.category, d.stage, d.start, d.end]); + + const barsAblation = { + type: "bar", + orientation: "h", + x: xAblation, base: baseAblation, y: yAblation, + marker: { color: colorsAblation, line: {width:1, color:"rgba(0,0,0,0.35)"} }, + hovertemplate: + "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+ + "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+ + "%{x:.2f}s", + customdata: customAblation, + showlegend: false + }; + + const legendTracesAblation = categoriesAblation.map(name => ({ + type: "scatter", mode: "markers", x:[null], y:[null], + name, marker: {color: colorMapAblation[name], size: 10}, + hoverinfo:"skip", showlegend:true + })); + + Plotly.newPlot("plotly-ablation-quant", [barsAblation, ...legendTracesAblation], { + barmode: "overlay", + bargap: 0.05, + margin: {l: 140, r: 30, t: 20, b: 40}, + xaxis: { title: "Time (seconds)", range: [0, 115] }, + yaxis: { + categoryorder: "array", + categoryarray: ["Sleep Mode (FP8)", "Sleep Mode (BF16)"] + }, + hovermode: "closest", + dragmode: "pan" + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js new file mode 100644 index 0000000..e2f0f94 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-switching.js @@ -0,0 +1,105 @@ +document.addEventListener('DOMContentLoaded', function() { + // Ablation switching data: BF16 vs FP8 + const ablationSwitchingData = { + "ModelA": { + name: "Qwen3-0.6B", + bf16: [0.28, 0.27, 0.27], + fp8: [0.18, 0.19, 0.16] + }, + "ModelB": { + name: "Phi-3-vision-128k", + bf16: [0.89, 0.93, 0.88], + fp8: [0.79, 0.77, 0.78] + } + }; + + function calcStatsAblSwitch(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + const modelsAblSwitch = Object.keys(ablationSwitchingData); + const bf16StatsSwitch = modelsAblSwitch.map(m => calcStatsAblSwitch(ablationSwitchingData[m].bf16)); + const fp8StatsSwitch = modelsAblSwitch.map(m => calcStatsAblSwitch(ablationSwitchingData[m].fp8)); + + const bf16TraceSwitch = { + x: modelsAblSwitch.map(m => ablationSwitchingData[m].name), + y: bf16StatsSwitch.map(s => s.mean), + name: "BF16", + type: "bar", + marker: { color: "#1f77b4" }, + error_y: { + type: "data", + symmetric: false, + array: bf16StatsSwitch.map(s => s.errorPlus), + arrayminus: bf16StatsSwitch.map(s => s.errorMinus), + color: "#0d4a6e", + thickness: 2, + width: 6 + }, + text: bf16StatsSwitch.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#1f77b4", weight: "bold" }, + hovertemplate: "%{x}
BF16: %{y:.2f}s" + }; + + const fp8TraceSwitch = { + x: modelsAblSwitch.map(m => ablationSwitchingData[m].name), + y: fp8StatsSwitch.map(s => s.mean), + name: "FP8", + type: "bar", + marker: { color: "#ff7f0e" }, + error_y: { + type: "data", + symmetric: false, + array: fp8StatsSwitch.map(s => s.errorPlus), + arrayminus: fp8StatsSwitch.map(s => s.errorMinus), + color: "#cc6600", + thickness: 2, + width: 6 + }, + text: fp8StatsSwitch.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#ff7f0e", weight: "bold" }, + hovertemplate: "%{x}
FP8: %{y:.2f}s" + }; + + // Calculate speedup percentages for annotation + const speedupsSwitchAbl = bf16StatsSwitch.map((bf16, i) => { + const reduction = ((bf16.mean - fp8StatsSwitch[i].mean) / bf16.mean * 100).toFixed(0); + return reduction; + }); + + Plotly.newPlot("plotly-ablation-switching", [bf16TraceSwitch, fp8TraceSwitch], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Wake Time (seconds)", + range: [0, Math.max(...bf16StatsSwitch.map(s => s.mean + s.errorPlus)) * 1.3] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: modelsAblSwitch.map((m, i) => ({ + x: ablationSwitchingData[m].name, + y: bf16StatsSwitch[i].mean + bf16StatsSwitch[i].errorPlus + 0.07, + text: `${speedupsSwitchAbl[i]}% faster`, + showarrow: false, + font: { size: 11, color: "#ff7f0e", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js new file mode 100644 index 0000000..2469df2 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-ablation-warmup.js @@ -0,0 +1,138 @@ +document.addEventListener('DOMContentLoaded', function() { + // Ablation study: With vs Without Warm-Up + const timingDataWarmup = { + "With Warm-Up": [ + { event: "A Model Load", duration: 37.65 }, + { event: "A Model Warm Up", duration: 2.39 }, + { event: "B Model Load", duration: 62.69 }, + { event: "B Model Warm Up", duration: 6 }, + { event: "A Model Wake up", duration: 0.24 }, + { event: "A Model Prompt", duration: 0.45 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.89 }, + { event: "B Model Prompt", duration: 0.93 }, + { event: "B Model Sleep", duration: 0.47 }, + { event: "A Model Wake up", duration: 0.23 }, + { event: "A Model Prompt", duration: 0.43 }, + { event: "A Model Sleep", duration: 0.1 }, + { event: "B Model Wake Up", duration: 0.87 }, + { event: "B Model Prompt", duration: 0.73 }, + { event: "B Model Sleep", duration: 0.46 }, + { event: "A Model Wake up", duration: 0.23 }, + { event: "A Model Prompt", duration: 0.46 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.85 }, + { event: "B Model Prompt", duration: 0.73 } + ], + "Without Warm-Up": [ + { event: "A Model Load", duration: 37.91 }, + { event: "B Model Load", duration: 63.16 }, + { event: "A Model Wake up", duration: 0.24 }, + { event: "A Model Prompt", duration: 2.59 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.91 }, + { event: "B Model Prompt", duration: 6.61 }, + { event: "B Model Sleep", duration: 0.44 }, + { event: "A Model Wake up", duration: 0.26 }, + { event: "A Model Prompt", duration: 0.41 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.87 }, + { event: "B Model Prompt", duration: 0.7 }, + { event: "B Model Sleep", duration: 0.43 }, + { event: "A Model Wake up", duration: 0.27 }, + { event: "A Model Prompt", duration: 0.42 }, + { event: "A Model Sleep", duration: 0.1 }, + { event: "B Model Wake Up", duration: 0.86 }, + { event: "B Model Prompt", duration: 0.7 } + ] + }; + + // Convert to segment format + function createSegmentsWarmup(timingData) { + const segments = []; + + Object.entries(timingData).forEach(([scenario, events]) => { + let cumulativeTime = 0; + + events.forEach(({ event, duration }) => { + const [who, ...stageParts] = event.split(' '); + const stage = stageParts.join(' '); + + let action, category; + if (stage.includes('Load')) { + action = 'Load'; + category = `${who} Load`; + } else if (stage.includes('Wake')) { + action = 'Wake'; + category = `${who} Wake`; + } else if (stage.includes('Prompt')) { + action = 'Prompt'; + category = `${who} Prompt`; + } else if (stage.includes('Sleep')) { + action = 'Sleep'; + category = `${who} Sleep`; + } else if (stage.includes('Warm')) { + action = 'Load'; + category = `${who} Load`; + } + + segments.push({ + scenario, + who, + stage, + action, + start: cumulativeTime, + end: cumulativeTime + duration, + duration, + category + }); + + cumulativeTime += duration; + }); + }); + + return segments; + } + + const segmentsWarmup = createSegmentsWarmup(timingDataWarmup); + const colorMapWarmup = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"}; + const categoriesWarmup = Object.keys(colorMapWarmup); + + const xWarmup = segmentsWarmup.map(d => d.duration); + const baseWarmup = segmentsWarmup.map(d => d.start); + const yWarmup = segmentsWarmup.map(d => d.scenario); + const colorsWarmup = segmentsWarmup.map(d => colorMapWarmup[d.category]); + const customWarmup = segmentsWarmup.map(d => [d.scenario, d.category, d.stage, d.start, d.end]); + + const barsWarmup = { + type: "bar", + orientation: "h", + x: xWarmup, base: baseWarmup, y: yWarmup, + marker: { color: colorsWarmup, line: {width:1, color:"rgba(0,0,0,0.35)"} }, + hovertemplate: + "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+ + "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+ + "%{x:.2f}s", + customdata: customWarmup, + showlegend: false + }; + + const legendTracesWarmup = categoriesWarmup.map(name => ({ + type: "scatter", mode: "markers", x:[null], y:[null], + name, marker: {color: colorMapWarmup[name], size: 10}, + hoverinfo:"skip", showlegend:true + })); + + Plotly.newPlot("plotly-ablation-warmup", [barsWarmup, ...legendTracesWarmup], { + barmode: "overlay", + bargap: 0.05, + margin: {l: 140, r: 30, t: 20, b: 40}, + xaxis: { title: "Time (seconds)", range: [0, 120] }, + yaxis: { + categoryorder: "array", + categoryarray: ["Without Warm-Up", "With Warm-Up"] + }, + hovermode: "closest", + dragmode: "pan" + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js new file mode 100644 index 0000000..5f1f803 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-inference-a4000.js @@ -0,0 +1,104 @@ +document.addEventListener('DOMContentLoaded', function() { + // A4000 Inference data + const inferenceDataA4000 = { + "ModelA": { + name: "Qwen3-0.6B", + wake: [0.44, 0.43, 0.43], + cold: [2.64, 2.5, 2.63] + }, + "ModelB": { + name: "Phi-3-vision-128k(4B)", + wake: [2.04, 1.73, 1.61], + cold: [9.78, 9.01, 9.79] + } + }; + + function calcStatsInfA4000(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + const modelsInfA4000 = Object.keys(inferenceDataA4000); + const wakeStatsInfA4000 = modelsInfA4000.map(m => calcStatsInfA4000(inferenceDataA4000[m].wake)); + const coldStatsInfA4000 = modelsInfA4000.map(m => calcStatsInfA4000(inferenceDataA4000[m].cold)); + + const wakeTraceInfA4000 = { + x: modelsInfA4000.map(m => inferenceDataA4000[m].name), + y: wakeStatsInfA4000.map(s => s.mean), + name: "Wake Mode (Warmed Up)", + type: "bar", + marker: { color: "#2ca02c" }, + error_y: { + type: "data", + symmetric: false, + array: wakeStatsInfA4000.map(s => s.errorPlus), + arrayminus: wakeStatsInfA4000.map(s => s.errorMinus), + color: "#1a5e1a", + thickness: 2, + width: 6 + }, + text: wakeStatsInfA4000.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#2ca02c", weight: "bold" }, + hovertemplate: "%{x}
Wake Mode: %{y:.2f}s" + }; + + const coldTraceInfA4000 = { + x: modelsInfA4000.map(m => inferenceDataA4000[m].name), + y: coldStatsInfA4000.map(s => s.mean), + name: "Cold Start (Just Loaded)", + type: "bar", + marker: { color: "#d62728" }, + error_y: { + type: "data", + symmetric: false, + array: coldStatsInfA4000.map(s => s.errorPlus), + arrayminus: coldStatsInfA4000.map(s => s.errorMinus), + color: "#8b1518", + thickness: 2, + width: 6 + }, + text: coldStatsInfA4000.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#d62728", weight: "bold" }, + hovertemplate: "%{x}
Cold Start: %{y:.2f}s" + }; + + const speedupsInfA4000 = wakeStatsInfA4000.map((w, i) => { + const reduction = ((coldStatsInfA4000[i].mean - w.mean) / coldStatsInfA4000[i].mean * 100).toFixed(0); + return reduction; + }); + + Plotly.newPlot("plotly-inference-a4000", [wakeTraceInfA4000, coldTraceInfA4000], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Inference Time (seconds)", + range: [0, Math.max(...coldStatsInfA4000.map(s => s.mean + s.errorPlus)) * 1.2] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: modelsInfA4000.map((m, i) => ({ + x: inferenceDataA4000[m].name, + y: coldStatsInfA4000[i].mean + coldStatsInfA4000[i].errorPlus + 0.6, + text: `${speedupsInfA4000[i]}% faster`, + showarrow: false, + font: { size: 11, color: "#2ca02c", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js new file mode 100644 index 0000000..80afa76 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-inference-comparison.js @@ -0,0 +1,107 @@ +document.addEventListener('DOMContentLoaded', function() { + // Raw data: Wake Inference Time vs Cold Start Inference Time + const inferenceData = { + "ModelA": { + name: "Qwen3-235B-A22B (TP=4)", + wake: [1.8, 1.7, 0.92], + cold: [3.8, 3.7, 3.72] + }, + "ModelB": { + name: "Qwen3-Coder-30B (TP=1)", + wake: [1.0, 0.93, 0.54], + cold: [3.7, 2.9, 2.45] + } + }; + + // Calculate mean and error bars for each model + function calcStats(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + // Prepare traces for both models + const models = Object.keys(inferenceData); + const wakeStats = models.map(m => calcStats(inferenceData[m].wake)); + const coldStats = models.map(m => calcStats(inferenceData[m].cold)); + + const wakeTrace = { + x: models.map(m => inferenceData[m].name), + y: wakeStats.map(s => s.mean), + name: "Wake Mode (Warmed Up)", + type: "bar", + marker: { color: "#2ca02c" }, + error_y: { + type: "data", + symmetric: false, + array: wakeStats.map(s => s.errorPlus), + arrayminus: wakeStats.map(s => s.errorMinus), + color: "#1a5e1a", + thickness: 2, + width: 6 + }, + text: wakeStats.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#2ca02c", weight: "bold" }, + hovertemplate: "%{x}
Wake Mode: %{y:.2f}s" + }; + + const coldTrace = { + x: models.map(m => inferenceData[m].name), + y: coldStats.map(s => s.mean), + name: "Cold Start (Just Loaded)", + type: "bar", + marker: { color: "#d62728" }, + error_y: { + type: "data", + symmetric: false, + array: coldStats.map(s => s.errorPlus), + arrayminus: coldStats.map(s => s.errorMinus), + color: "#8b1518", + thickness: 2, + width: 6 + }, + text: coldStats.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#d62728", weight: "bold" }, + hovertemplate: "%{x}
Cold Start: %{y:.2f}s" + }; + + // Calculate speedup percentages for annotation + const speedups = wakeStats.map((w, i) => { + const reduction = ((coldStats[i].mean - w.mean) / coldStats[i].mean * 100).toFixed(0); + return reduction; + }); + + Plotly.newPlot("plotly-inference-comparison", [wakeTrace, coldTrace], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Inference Time (seconds)", + range: [0, Math.max(...coldStats.map(s => s.mean + s.errorPlus)) * 1.2] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: models.map((m, i) => ({ + x: inferenceData[m].name, + y: coldStats[i].mean + coldStats[i].errorPlus + 0.3, + text: `${speedups[i]}% faster`, + showarrow: false, + font: { size: 11, color: "#2ca02c", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js b/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js new file mode 100644 index 0000000..48082f2 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-level2-inference.js @@ -0,0 +1,104 @@ +document.addEventListener('DOMContentLoaded', function() { + // Level 2 inference data + const level2InferenceData = { + "ModelA": { + name: "Qwen3-0.6B", + wake: [0.68, 0.46, 0.44], + cold: [4.66, 3.8, 2.56] + }, + "ModelB": { + name: "Phi-3-vision-128k", + wake: [0.78, 0.77, 0.72], + cold: [6.55, 6.21, 6.15] + } + }; + + function calcStatsLevel2Inf(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + const modelsLevel2Inf = Object.keys(level2InferenceData); + const wakeStatsLevel2Inf = modelsLevel2Inf.map(m => calcStatsLevel2Inf(level2InferenceData[m].wake)); + const coldStatsLevel2Inf = modelsLevel2Inf.map(m => calcStatsLevel2Inf(level2InferenceData[m].cold)); + + const wakeTraceLevel2Inf = { + x: modelsLevel2Inf.map(m => level2InferenceData[m].name), + y: wakeStatsLevel2Inf.map(s => s.mean), + name: "Wake Mode (Level 2)", + type: "bar", + marker: { color: "#2ca02c" }, + error_y: { + type: "data", + symmetric: false, + array: wakeStatsLevel2Inf.map(s => s.errorPlus), + arrayminus: wakeStatsLevel2Inf.map(s => s.errorMinus), + color: "#1a5e1a", + thickness: 2, + width: 6 + }, + text: wakeStatsLevel2Inf.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#2ca02c", weight: "bold" }, + hovertemplate: "%{x}
Wake Mode: %{y:.2f}s" + }; + + const coldTraceLevel2Inf = { + x: modelsLevel2Inf.map(m => level2InferenceData[m].name), + y: coldStatsLevel2Inf.map(s => s.mean), + name: "Cold Start", + type: "bar", + marker: { color: "#d62728" }, + error_y: { + type: "data", + symmetric: false, + array: coldStatsLevel2Inf.map(s => s.errorPlus), + arrayminus: coldStatsLevel2Inf.map(s => s.errorMinus), + color: "#8b1518", + thickness: 2, + width: 6 + }, + text: coldStatsLevel2Inf.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#d62728", weight: "bold" }, + hovertemplate: "%{x}
Cold Start: %{y:.2f}s" + }; + + const speedupsLevel2Inf = wakeStatsLevel2Inf.map((w, i) => { + const reduction = ((coldStatsLevel2Inf[i].mean - w.mean) / coldStatsLevel2Inf[i].mean * 100).toFixed(0); + return reduction; + }); + + Plotly.newPlot("plotly-level2-inference", [wakeTraceLevel2Inf, coldTraceLevel2Inf], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Inference Time (seconds)", + range: [0, Math.max(...coldStatsLevel2Inf.map(s => s.mean + s.errorPlus)) * 1.2] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: modelsLevel2Inf.map((m, i) => ({ + x: level2InferenceData[m].name, + y: coldStatsLevel2Inf[i].mean + coldStatsLevel2Inf[i].errorPlus + 0.4, + text: `${speedupsLevel2Inf[i]}% faster`, + showarrow: false, + font: { size: 11, color: "#2ca02c", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js b/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js new file mode 100644 index 0000000..87d7c18 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-level2-switching.js @@ -0,0 +1,104 @@ +document.addEventListener('DOMContentLoaded', function() { + // Level 2 switching data + const level2SwitchingData = { + "ModelA": { + name: "Qwen3-0.6B", + wake: [0.91, 0.78, 0.85], + cold: [38.53, 37.21, 38.15] + }, + "ModelB": { + name: "Phi-3-vision-128k", + wake: [2.55, 2.62, 2.58], + cold: [58.52, 57.65, 58.2] + } + }; + + function calcStatsLevel2Switch(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + const modelsLevel2Switch = Object.keys(level2SwitchingData); + const wakeStatsLevel2Switch = modelsLevel2Switch.map(m => calcStatsLevel2Switch(level2SwitchingData[m].wake)); + const coldStatsLevel2Switch = modelsLevel2Switch.map(m => calcStatsLevel2Switch(level2SwitchingData[m].cold)); + + const wakeTraceLevel2Switch = { + x: modelsLevel2Switch.map(m => level2SwitchingData[m].name), + y: wakeStatsLevel2Switch.map(s => s.mean), + name: "Wake from Sleep (Level 2)", + type: "bar", + marker: { color: "#2ca02c" }, + error_y: { + type: "data", + symmetric: false, + array: wakeStatsLevel2Switch.map(s => s.errorPlus), + arrayminus: wakeStatsLevel2Switch.map(s => s.errorMinus), + color: "#1a5e1a", + thickness: 2, + width: 6 + }, + text: wakeStatsLevel2Switch.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#2ca02c", weight: "bold" }, + hovertemplate: "%{x}
Wake Time: %{y:.2f}s" + }; + + const coldTraceLevel2Switch = { + x: modelsLevel2Switch.map(m => level2SwitchingData[m].name), + y: coldStatsLevel2Switch.map(s => s.mean), + name: "Cold Start (Fresh Load)", + type: "bar", + marker: { color: "#d62728" }, + error_y: { + type: "data", + symmetric: false, + array: coldStatsLevel2Switch.map(s => s.errorPlus), + arrayminus: coldStatsLevel2Switch.map(s => s.errorMinus), + color: "#8b1518", + thickness: 2, + width: 6 + }, + text: coldStatsLevel2Switch.map(s => s.mean.toFixed(1) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#d62728", weight: "bold" }, + hovertemplate: "%{x}
Cold Start: %{y:.2f}s" + }; + + const speedupsLevel2Switch = wakeStatsLevel2Switch.map((w, i) => { + const speedup = (coldStatsLevel2Switch[i].mean / w.mean).toFixed(0); + return speedup; + }); + + Plotly.newPlot("plotly-level2-switching", [wakeTraceLevel2Switch, coldTraceLevel2Switch], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Switching Time (seconds)", + range: [0, Math.max(...coldStatsLevel2Switch.map(s => s.mean + s.errorPlus)) * 1.15] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: modelsLevel2Switch.map((m, i) => ({ + x: level2SwitchingData[m].name, + y: coldStatsLevel2Switch[i].mean + coldStatsLevel2Switch[i].errorPlus + 3, + text: `${speedupsLevel2Switch[i]}x faster`, + showarrow: false, + font: { size: 11, color: "#2ca02c", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js new file mode 100644 index 0000000..9de47fc --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-levels-comparison.js @@ -0,0 +1,154 @@ +document.addEventListener('DOMContentLoaded', function() { + // Sleep Levels Comparison timing data + const timingDataLevelsComp = { + "Sleep Mode (Level 1)": [ + { event: "A Model Load", duration: 36.27 }, + { event: "A Model Warm Up", duration: 2.53 }, + { event: "B Model Load", duration: 58.24 }, + { event: "B Model Warm Up", duration: 5.95 }, + { event: "A Model Wake up", duration: 0.25 }, + { event: "A Model Prompt", duration: 0.43 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.82 }, + { event: "B Model Prompt", duration: 0.86 }, + { event: "B Model Sleep", duration: 0.41 }, + { event: "A Model Wake up", duration: 0.28 }, + { event: "A Model Prompt", duration: 0.41 }, + { event: "A Model Sleep", duration: 0.1 }, + { event: "B Model Wake Up", duration: 0.82 }, + { event: "B Model Prompt", duration: 0.71 }, + { event: "B Model Sleep", duration: 0.42 }, + { event: "A Model Wake up", duration: 0.25 }, + { event: "A Model Prompt", duration: 0.45 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 0.83 }, + { event: "B Model Prompt", duration: 0.71 } + ], + "Sleep Mode (Level 2)": [ + { event: "A Model Load", duration: 38.55 }, + { event: "A Model Warm Up", duration: 2.53 }, + { event: "B Model Load", duration: 61.23 }, + { event: "B Model Warm Up", duration: 5.75 }, + { event: "A Model Wake up", duration: 0.91 }, + { event: "A Model Prompt", duration: 0.68 }, + { event: "A Model Sleep", duration: 0.13 }, + { event: "B Model Wake Up", duration: 2.55 }, + { event: "B Model Prompt", duration: 0.78 }, + { event: "B Model Sleep", duration: 0.46 }, + { event: "A Model Wake up", duration: 0.78 }, + { event: "A Model Prompt", duration: 0.46 }, + { event: "A Model Sleep", duration: 0.12 }, + { event: "B Model Wake Up", duration: 2.62 }, + { event: "B Model Prompt", duration: 0.77 }, + { event: "B Model Sleep", duration: 0.45 }, + { event: "A Model Wake up", duration: 0.85 }, + { event: "A Model Prompt", duration: 0.44 }, + { event: "A Model Sleep", duration: 0.09 }, + { event: "B Model Wake Up", duration: 2.58 }, + { event: "B Model Prompt", duration: 0.72 } + ], + "WITHOUT Sleep Mode": [ + { event: "A Model Load", duration: 38.53 }, + { event: "A Model Prompt", duration: 4.66 }, + { event: "B Model Load", duration: 58.52 }, + { event: "B Model Prompt", duration: 6.55 }, + { event: "A Model Load", duration: 37.21 }, + { event: "A Model Prompt", duration: 3.8 }, + { event: "B Model Load", duration: 57.65 }, + { event: "B Model Prompt", duration: 6.21 }, + { event: "A Model Load", duration: 38.15 }, + { event: "A Model Prompt", duration: 2.56 }, + { event: "B Model Load", duration: 58.2 }, + { event: "B Model Prompt", duration: 6.15 } + ] + }; + + // Convert to segment format + function createSegmentsLevelsComp(timingData) { + const segments = []; + + Object.entries(timingData).forEach(([scenario, events]) => { + let cumulativeTime = 0; + + events.forEach(({ event, duration }) => { + const [who, ...stageParts] = event.split(' '); + const stage = stageParts.join(' '); + + let action, category; + if (stage.includes('Load')) { + action = 'Load'; + category = `${who} Load`; + } else if (stage.includes('Wake')) { + action = 'Wake'; + category = `${who} Wake`; + } else if (stage.includes('Prompt')) { + action = 'Prompt'; + category = `${who} Prompt`; + } else if (stage.includes('Sleep')) { + action = 'Sleep'; + category = `${who} Sleep`; + } else if (stage.includes('Warm')) { + action = 'Load'; + category = `${who} Load`; + } + + segments.push({ + scenario, + who, + stage, + action, + start: cumulativeTime, + end: cumulativeTime + duration, + duration, + category + }); + + cumulativeTime += duration; + }); + }); + + return segments; + } + + const segmentsLevelsComp = createSegmentsLevelsComp(timingDataLevelsComp); + const colorMapLevelsComp = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"}; + const categoriesLevelsComp = Object.keys(colorMapLevelsComp); + + const xLevelsComp = segmentsLevelsComp.map(d => d.duration); + const baseLevelsComp = segmentsLevelsComp.map(d => d.start); + const yLevelsComp = segmentsLevelsComp.map(d => d.scenario); + const colorsLevelsComp = segmentsLevelsComp.map(d => colorMapLevelsComp[d.category]); + const customLevelsComp = segmentsLevelsComp.map(d => [d.scenario, d.category, d.stage, d.start, d.end]); + + const barsLevelsComp = { + type: "bar", + orientation: "h", + x: xLevelsComp, base: baseLevelsComp, y: yLevelsComp, + marker: { color: colorsLevelsComp, line: {width:1, color:"rgba(0,0,0,0.35)"} }, + hovertemplate: + "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+ + "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+ + "%{x:.2f}s", + customdata: customLevelsComp, + showlegend: false + }; + + const legendTracesLevelsComp = categoriesLevelsComp.map(name => ({ + type: "scatter", mode: "markers", x:[null], y:[null], + name, marker: {color: colorMapLevelsComp[name], size: 10}, + hoverinfo:"skip", showlegend:true + })); + + Plotly.newPlot("plotly-sleep-levels-comparison", [barsLevelsComp, ...legendTracesLevelsComp], { + barmode: "overlay", + bargap: 0.05, + margin: {l: 160, r: 30, t: 20, b: 40}, + xaxis: { title: "Time (seconds)", range: [0, 365] }, + yaxis: { + categoryorder: "array", + categoryarray: ["WITHOUT Sleep Mode", "Sleep Mode (Level 2)", "Sleep Mode (Level 1)"] + }, + hovermode: "closest", + dragmode: "pan" + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js new file mode 100644 index 0000000..d029412 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode-a4000.js @@ -0,0 +1,134 @@ +document.addEventListener('DOMContentLoaded', function() { + // A4000 GPU timing data + const timingDataA4000 = { + "WITH Sleep Mode (L1)": [ + { event: "A Model Load", duration: 21.01 }, + { event: "A Model Warm up", duration: 2.49 }, + { event: "B Model Load", duration: 46.01 }, + { event: "B Model Warm up", duration: 7.37 }, + { event: "A Model Wake up", duration: 0.11 }, + { event: "A Model Prompt", duration: 0.44 }, + { event: "A Model Sleep", duration: 0.13 }, + { event: "B Model Wake Up", duration: 0.8 }, + { event: "B Model Prompt", duration: 2.04 }, + { event: "B Model Sleep", duration: 0.68 }, + { event: "A Model Wake up", duration: 0.1 }, + { event: "A Model Prompt", duration: 0.43 }, + { event: "A Model Sleep", duration: 0.13 }, + { event: "B Model Wake Up", duration: 0.8 }, + { event: "B Model Prompt", duration: 1.73 }, + { event: "B Model Sleep", duration: 0.68 }, + { event: "A Model Wake up", duration: 0.1 }, + { event: "A Model Prompt", duration: 0.43 }, + { event: "A Model Sleep", duration: 0.13 }, + { event: "B Model Wake Up", duration: 0.8 }, + { event: "B Model Prompt", duration: 1.61 } + ], + "WITHOUT Sleep Mode": [ + { event: "A Model Load", duration: 21.04 }, + { event: "A Model Prompt", duration: 2.64 }, + { event: "B Model Load", duration: 46.01 }, + { event: "B Model Prompt", duration: 9.78 }, + { event: "A Model Load", duration: 20.98 }, + { event: "A Model Prompt", duration: 2.5 }, + { event: "B Model Load", duration: 46.02 }, + { event: "B Model Prompt", duration: 9.01 }, + { event: "A Model Load", duration: 20.98 }, + { event: "A Model Prompt", duration: 2.63 }, + { event: "B Model Load", duration: 46.02 }, + { event: "B Model Prompt", duration: 9.79 } + ] + }; + + // Convert simplified data to full segment format + function createSegmentsA4000(timingData) { + const segments = []; + + Object.entries(timingData).forEach(([scenario, events]) => { + let cumulativeTime = 0; + + events.forEach(({ event, duration }) => { + const [who, ...stageParts] = event.split(' '); + const stage = stageParts.join(' '); + + // Determine action and category from stage + let action, category; + if (stage.includes('Load')) { + action = 'Load'; + category = `${who} Load`; + } else if (stage.includes('Wake')) { + action = 'Wake'; + category = `${who} Wake`; + } else if (stage.includes('Prompt')) { + action = 'Prompt'; + category = `${who} Prompt`; + } else if (stage.includes('Sleep')) { + action = 'Sleep'; + category = `${who} Sleep`; + } else if (stage.includes('Warm up')) { + action = 'Load'; + category = `${who} Load`; + } + + segments.push({ + scenario, + who, + stage, + action, + start: cumulativeTime, + end: cumulativeTime + duration, + duration, + category + }); + + cumulativeTime += duration; + }); + }); + + return segments; + } + + const segmentsA4000 = createSegmentsA4000(timingDataA4000); + const colorMapA4000 = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"}; + const categoriesA4000 = Object.keys(colorMapA4000); + + // Build arrays for a single stacked-horizontal bar trace using "base" + const xA4000 = segmentsA4000.map(d => d.duration); + const baseA4000 = segmentsA4000.map(d => d.start); + const yA4000 = segmentsA4000.map(d => d.scenario); + const colorsA4000 = segmentsA4000.map(d => colorMapA4000[d.category]); + const customA4000 = segmentsA4000.map(d => [d.scenario, d.category, d.stage, d.start, d.end]); + + const barsA4000 = { + type: "bar", + orientation: "h", + x: xA4000, base: baseA4000, y: yA4000, + marker: { color: colorsA4000, line: {width:1, color:"rgba(0,0,0,0.35)"} }, + hovertemplate: + "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+ + "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+ + "%{x:.2f}s", + customdata: customA4000, + showlegend: false + }; + + // Legend-only dummies to produce a clean 8-item legend + const legendTracesA4000 = categoriesA4000.map(name => ({ + type: "scatter", mode: "markers", x:[null], y:[null], + name, marker: {color: colorMapA4000[name], size: 10}, + hoverinfo:"skip", showlegend:true + })); + + Plotly.newPlot("plotly-sleep-mode-a4000", [barsA4000, ...legendTracesA4000], { + barmode: "overlay", + bargap: 0.05, + margin: {l: 140, r: 30, t: 20, b: 40}, + xaxis: { title: "Time (seconds)", range: [0, 235] }, + yaxis: { + categoryorder: "array", + categoryarray: ["WITHOUT Sleep Mode", "WITH Sleep Mode (L1)"] + }, + hovermode: "closest", + dragmode: "pan" + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js new file mode 100644 index 0000000..ef92aa3 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-sleep-mode.js @@ -0,0 +1,131 @@ +document.addEventListener('DOMContentLoaded', function() { + const timingData = { + "WITH Sleep Mode (L1)": [ + { event: "A Model Load", duration: 97.61 }, + { event: "A Model Warm up", duration: 2.38 }, + { event: "B Model Load", duration: 47.63 }, + { event: "B Model Warm up", duration: 2.42 }, + { event: "A Model Wake up", duration: 5.66 }, + { event: "A Model Prompt", duration: 1.8 }, + { event: "A Model Sleep", duration: 6.01 }, + { event: "B Model Wake Up", duration: 2.89 }, + { event: "B Model Prompt", duration: 1 }, + { event: "B Model Sleep", duration: 2.78 }, + { event: "A Model Wake up", duration: 5.29 }, + { event: "A Model Prompt", duration: 1.7 }, + { event: "A Model Sleep", duration: 5.78 }, + { event: "B Model Wake Up", duration: 2.86 }, + { event: "B Model Prompt", duration: 0.93 }, + { event: "B Model Sleep", duration: 2.78 }, + { event: "A Model Wake up", duration: 5.27 }, + { event: "A Model Prompt", duration: 0.92 }, + { event: "A Model Sleep", duration: 5.89 }, + { event: "B Model Wake Up", duration: 2.85 }, + { event: "B Model Prompt", duration: 0.54 } + ], + "WITHOUT Sleep Mode": [ + { event: "A Model Load", duration: 97.9 }, + { event: "A Model Prompt", duration: 3.8 }, + { event: "B Model Load", duration: 47.33 }, + { event: "B Model Prompt", duration: 3.7 }, + { event: "A Model Load", duration: 97.4 }, + { event: "A Model Prompt", duration: 3.7 }, + { event: "B Model Load", duration: 47.47 }, + { event: "B Model Prompt", duration: 2.9 }, + { event: "A Model Load", duration: 97.71 }, + { event: "A Model Prompt", duration: 3.72 }, + { event: "B Model Load", duration: 47.46 }, + { event: "B Model Prompt", duration: 2.45 } + ] + }; + + function createSegments(timingData) { + const segments = []; + + Object.entries(timingData).forEach(([scenario, events]) => { + let cumulativeTime = 0; + + events.forEach(({ event, duration }) => { + const [who, ...stageParts] = event.split(' '); + const stage = stageParts.join(' '); + + // Determine action and category from stage + let action, category; + if (stage.includes('Load')) { + action = 'Load'; + category = `${who} Load`; + } else if (stage.includes('Wake')) { + action = 'Wake'; + category = `${who} Wake`; + } else if (stage.includes('Prompt')) { + action = 'Prompt'; + category = `${who} Prompt`; + } else if (stage.includes('Sleep')) { + action = 'Sleep'; + category = `${who} Sleep`; + } else if (stage.includes('Warm up')) { + action = 'Load'; + category = `${who} Load`; + } + + segments.push({ + scenario, + who, + stage, + action, + start: cumulativeTime, + end: cumulativeTime + duration, + duration, + category + }); + + cumulativeTime += duration; + }); + }); + + return segments; + } + + const segments = createSegments(timingData); + const colorMap = {"A Load": "#1f77b4", "B Load": "#ff7f0e", "A Wake": "#2ca02c", "B Wake": "#17becf", "A Sleep": "#9467bd", "B Sleep": "#8c564b", "A Prompt": "#e377c2", "B Prompt": "#7f7f7f"}; + const categories = Object.keys(colorMap); + + // Build arrays for a single stacked-horizontal bar trace using "base" + const x = segments.map(d => d.duration); + const base = segments.map(d => d.start); + const y = segments.map(d => d.scenario); + const colors = segments.map(d => colorMap[d.category]); + const custom = segments.map(d => [d.scenario, d.category, d.stage, d.start, d.end]); + + const bars = { + type: "bar", + orientation: "h", + x, base, y, + marker: { color: colors, line: {width:1, color:"rgba(0,0,0,0.35)"} }, + hovertemplate: + "%{customdata[0]}
%{customdata[1]} — %{customdata[2]}
"+ + "Start %{customdata[3]:.2f}s → End %{customdata[4]:.2f}s
"+ + "%{x:.2f}s", + customdata: custom, + showlegend: false + }; + + const legendTraces = categories.map(name => ({ + type: "scatter", mode: "markers", x:[null], y:[null], + name, marker: {color: colorMap[name], size: 10}, + hoverinfo:"skip", showlegend:true + })); + + Plotly.newPlot("plotly-sleep-mode", [bars, ...legendTraces], { + barmode: "overlay", + bargap: 0.05, + margin: {l: 140, r: 30, t: 20, b: 40}, + xaxis: { title: "Time (seconds)", range: [0, 478.32] }, + yaxis: { + categoryorder: "array", + categoryarray: ["WITHOUT Sleep Mode", "WITH Sleep Mode (L1)"] + }, + hovermode: "closest", + dragmode: "pan" + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js b/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js new file mode 100644 index 0000000..4013f62 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-switching-a4000.js @@ -0,0 +1,104 @@ +document.addEventListener('DOMContentLoaded', function() { + // A4000 Switching data + const switchingDataA4000 = { + "ModelA": { + name: "Qwen3-0.6B", + wake: [0.11, 0.1, 0.1], + cold: [21.04, 20.98, 20.98] + }, + "ModelB": { + name: "Phi-3-vision-128k(4B)", + wake: [0.8, 0.8, 0.8], + cold: [46.01, 46.02, 46.02] + } + }; + + function calcStatsSwitchA4000(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + const modelsSwitchA4000 = Object.keys(switchingDataA4000); + const wakeStatsSwitchA4000 = modelsSwitchA4000.map(m => calcStatsSwitchA4000(switchingDataA4000[m].wake)); + const coldStatsSwitchA4000 = modelsSwitchA4000.map(m => calcStatsSwitchA4000(switchingDataA4000[m].cold)); + + const wakeTraceSwitchA4000 = { + x: modelsSwitchA4000.map(m => switchingDataA4000[m].name), + y: wakeStatsSwitchA4000.map(s => s.mean), + name: "Wake from Sleep", + type: "bar", + marker: { color: "#2ca02c" }, + error_y: { + type: "data", + symmetric: false, + array: wakeStatsSwitchA4000.map(s => s.errorPlus), + arrayminus: wakeStatsSwitchA4000.map(s => s.errorMinus), + color: "#1a5e1a", + thickness: 2, + width: 6 + }, + text: wakeStatsSwitchA4000.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#2ca02c", weight: "bold" }, + hovertemplate: "%{x}
Wake Time: %{y:.2f}s" + }; + + const coldTraceSwitchA4000 = { + x: modelsSwitchA4000.map(m => switchingDataA4000[m].name), + y: coldStatsSwitchA4000.map(s => s.mean), + name: "Cold Start (Fresh Load)", + type: "bar", + marker: { color: "#d62728" }, + error_y: { + type: "data", + symmetric: false, + array: coldStatsSwitchA4000.map(s => s.errorPlus), + arrayminus: coldStatsSwitchA4000.map(s => s.errorMinus), + color: "#8b1518", + thickness: 2, + width: 6 + }, + text: coldStatsSwitchA4000.map(s => s.mean.toFixed(1) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#d62728", weight: "bold" }, + hovertemplate: "%{x}
Cold Start: %{y:.2f}s" + }; + + const speedupsSwitchA4000 = wakeStatsSwitchA4000.map((w, i) => { + const speedup = (coldStatsSwitchA4000[i].mean / w.mean).toFixed(0); + return speedup; + }); + + Plotly.newPlot("plotly-switching-a4000", [wakeTraceSwitchA4000, coldTraceSwitchA4000], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Switching Time (seconds)", + range: [0, Math.max(...coldStatsSwitchA4000.map(s => s.mean + s.errorPlus)) * 1.15] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: modelsSwitchA4000.map((m, i) => ({ + x: switchingDataA4000[m].name, + y: coldStatsSwitchA4000[i].mean + coldStatsSwitchA4000[i].errorPlus + 3, + text: `${speedupsSwitchA4000[i]}x faster`, + showarrow: false, + font: { size: 11, color: "#2ca02c", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js b/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js new file mode 100644 index 0000000..3130701 --- /dev/null +++ b/assets/figures/2025-vllm-sleep-mode/plotly-switching-comparison.js @@ -0,0 +1,107 @@ +document.addEventListener('DOMContentLoaded', function() { + // Raw data: Wake Time vs Cold Start Time + const switchingData = { + "ModelA": { + name: "Qwen3-235B-A22B (TP=4)", + wake: [5.66, 5.29, 5.27], + cold: [97.9, 97.4, 97.71] + }, + "ModelB": { + name: "Qwen3-Coder-30B (TP=1)", + wake: [2.89, 2.86, 2.85], + cold: [47.33, 47.47, 47.46] + } + }; + + // Calculate mean and error bars for each model + function calcStatsSwitch(values) { + const mean = values.reduce((a, b) => a + b, 0) / values.length; + const min = Math.min(...values); + const max = Math.max(...values); + return { mean, errorMinus: mean - min, errorPlus: max - mean }; + } + + // Prepare traces for both models + const modelsSwitch = Object.keys(switchingData); + const wakeStatsSwitch = modelsSwitch.map(m => calcStatsSwitch(switchingData[m].wake)); + const coldStatsSwitch = modelsSwitch.map(m => calcStatsSwitch(switchingData[m].cold)); + + const wakeTraceSwitch = { + x: modelsSwitch.map(m => switchingData[m].name), + y: wakeStatsSwitch.map(s => s.mean), + name: "Wake from Sleep", + type: "bar", + marker: { color: "#2ca02c" }, + error_y: { + type: "data", + symmetric: false, + array: wakeStatsSwitch.map(s => s.errorPlus), + arrayminus: wakeStatsSwitch.map(s => s.errorMinus), + color: "#1a5e1a", + thickness: 2, + width: 6 + }, + text: wakeStatsSwitch.map(s => s.mean.toFixed(2) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#2ca02c", weight: "bold" }, + hovertemplate: "%{x}
Wake Time: %{y:.2f}s" + }; + + const coldTraceSwitch = { + x: modelsSwitch.map(m => switchingData[m].name), + y: coldStatsSwitch.map(s => s.mean), + name: "Cold Start (Fresh Load)", + type: "bar", + marker: { color: "#d62728" }, + error_y: { + type: "data", + symmetric: false, + array: coldStatsSwitch.map(s => s.errorPlus), + arrayminus: coldStatsSwitch.map(s => s.errorMinus), + color: "#8b1518", + thickness: 2, + width: 6 + }, + text: coldStatsSwitch.map(s => s.mean.toFixed(1) + "s"), + textposition: "outside", + textfont: { size: 12, color: "#d62728", weight: "bold" }, + hovertemplate: "%{x}
Cold Start: %{y:.2f}s" + }; + + // Calculate speedup multiples for annotation + const speedupsSwitch = wakeStatsSwitch.map((w, i) => { + const speedup = (coldStatsSwitch[i].mean / w.mean).toFixed(0); + return speedup; + }); + + Plotly.newPlot("plotly-switching-comparison", [wakeTraceSwitch, coldTraceSwitch], { + barmode: "group", + bargap: 0.15, + bargroupgap: 0.1, + margin: { l: 60, r: 30, t: 40, b: 50 }, + xaxis: { + title: "", + tickangle: 0 + }, + yaxis: { + title: "Switching Time (seconds)", + range: [0, Math.max(...coldStatsSwitch.map(s => s.mean + s.errorPlus)) * 1.15] + }, + hovermode: "closest", + legend: { + x: 0.5, + y: 1.15, + xanchor: "center", + yanchor: "top", + orientation: "h" + }, + annotations: modelsSwitch.map((m, i) => ({ + x: switchingData[m].name, + y: coldStatsSwitch[i].mean + coldStatsSwitch[i].errorPlus + 5, + text: `${speedupsSwitch[i]}x faster`, + showarrow: false, + font: { size: 11, color: "#2ca02c", weight: "bold" }, + xanchor: "center" + })) + }, {displayModeBar: true, responsive: true}); +}); diff --git a/assets/figures/2025-vllm-sleep-mode/sleepmode.png b/assets/figures/2025-vllm-sleep-mode/sleepmode.png new file mode 100644 index 0000000..4a918ec Binary files /dev/null and b/assets/figures/2025-vllm-sleep-mode/sleepmode.png differ