From 3b15e9c06efd040519c6c07793fe97a5b827b748 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 25 Feb 2026 12:24:18 +0000 Subject: [PATCH] docs(test): add Phase 7B.6 latency benchmark protocol MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 5 representative tasks testing each 7B optimization: - Task A: Simple chat → 7B.2 model routing (< 5s, fast model) - Task B: Multi-tool → 7B.1 speculative execution (< 20s, 2 tools/1 iter) - Task C: GitHub read → 7B.3+7B.4 prefetch+injection (< 30s, ≤ 3 iter) - Task D: Orchestra → all optimizations end-to-end (< 3min, ≤ 15 iter) - Task E: Reasoning → 7B.5 streaming feedback (first update < 3s) Includes pass/conditional/fail criteria and comparison notes. https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw --- TEST_PROTOCOL.md | 158 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) diff --git a/TEST_PROTOCOL.md b/TEST_PROTOCOL.md index 0df43aba3..8e0cfea82 100644 --- a/TEST_PROTOCOL.md +++ b/TEST_PROTOCOL.md @@ -136,3 +136,161 @@ Copy this table, fill in as you go: ``` **Pass criteria:** All 40 tests pass. If any fail, note the exact response and which model was active. + +--- + +## 11. Phase 7B.6 — Latency Benchmark Protocol + +> **Human checkpoint 7B.6:** Benchmark before/after — measure end-to-end latency on 5 representative tasks. +> +> Validates that Phase 7B speed optimizations (speculative execution, model routing, +> file prefetching, iteration reduction, streaming feedback) deliver real-world improvement. + +### Prerequisites + +- Deploy the current build with all 7B optimizations enabled +- Use Telegram (production path — Workers + Durable Objects) +- Run `/new` before each test to start with clean context +- Note the Cloudflare region (Workers dashboard → Analytics) + +### What to Record + +For each task, capture from the final response footer: + +| Field | Source | +|-------|--------| +| **Wall-clock (s)** | `⏱️ Xs` in response footer | +| **Iterations** | `(N iter)` in response footer | +| **Tools used** | `[Used N tool(s): ...]` header | +| **Model** | `🤖 /alias` in footer | +| **Token cost** | Cost footer (if shown) | + +Also note from the Telegram UX: +- **Time-to-first-update**: seconds from send until first "⏳" status appears +- **Progress clarity**: could you tell what the bot was doing? (Y/N) + +### The 5 Benchmark Tasks + +#### Task A: Simple Chat (tests 7B.2 — model routing) + +``` +/use auto +What is the capital of France? +``` + +| Metric | Expected | +|--------|----------| +| Wall-clock | < 5s | +| Iterations | 1 | +| Tools | 0 | +| Model | mini, flash, or haiku (NOT deep/gpt/sonnet) | + +**What 7B.2 does:** Routes simple queries to a fast model instead of the default heavyweight. +**Pass:** Response arrives in ≤ 5s AND model shown is a fast candidate (mini/flash/haiku). + +--- + +#### Task B: Multi-Tool Research (tests 7B.1 — speculative execution) + +``` +/use deep +What's the weather in Prague and what's Bitcoin trading at? +``` + +| Metric | Expected | +|--------|----------| +| Wall-clock | < 20s | +| Iterations | 1–2 | +| Tools | 2 (get_weather, get_crypto) | + +**What 7B.1 does:** Starts tool execution during streaming — both tools should fire in parallel before the full response arrives. +**Pass:** Both tools called in a single iteration, wall-clock noticeably lower than 2× single-tool time. + +--- + +#### Task C: GitHub File Reading (tests 7B.3 + 7B.4 — prefetch + injection) + +``` +/use deep +Read the README.md and package.json from PetrAnto/moltworker and summarize the project stack +``` + +| Metric | Expected (with 7B) | Baseline (without 7B) | +|--------|--------------------|-----------------------| +| Wall-clock | < 30s | ~45–60s | +| Iterations | 1–3 | 4–6 | +| Tools | 2–4 | 4–6 | + +**What 7B.3 + 7B.4 do:** File paths are extracted from the user message, GitHub reads start in parallel with the first LLM call, and file contents are injected into context at the plan→work transition — so the model doesn't need separate `github_read_file` iterations. +**Pass:** Iteration count ≤ 3 AND wall-clock under 30s. + +--- + +#### Task D: Orchestra Run (tests all 7B optimizations end-to-end) + +Pick a repo with a ROADMAP.md (e.g., one previously initialized with `/orchestra init`): + +``` +/orchestra run / +``` + +| Metric | Expected (with 7B) | Baseline (without 7B) | +|--------|--------------------|-----------------------| +| Wall-clock | < 3 min | ~4–6 min | +| Iterations | 8–15 | 15–25 | +| Tools | 5–15 | 10–25 | + +**What the full stack does:** File prefetch on roadmap/work-log reads, speculative execution on parallel-safe tool calls, fewer iterations due to injected file contents, streaming progress updates throughout. +**Pass:** Iteration count ≤ 15 AND progress messages showed meaningful context (tool names, plan steps). + +--- + +#### Task E: Non-Tool Reasoning (tests 7B.5 — streaming feedback + baseline) + +``` +/use deep +think:high Compare the architectural trade-offs between microservices and monoliths for a team of 5 developers building a SaaS product. Consider deployment complexity, debugging, and team velocity. +``` + +| Metric | Expected | +|--------|----------| +| Wall-clock | < 30s | +| Iterations | 1 | +| Tools | 0 | +| Time-to-first-update | < 3s | + +**What 7B.5 does:** Even with no tools, the streaming feedback shows the user a "⏳ 📋 Planning…" or "⏳ Thinking…" status within seconds. +**Pass:** First status message appears in ≤ 3s AND final response is substantive. + +--- + +### Results Table + +Copy and fill in: + +``` +| Task | Wall-clock | Iterations | Tools | Model | First-update | Progress clear? | Pass? | Notes | +|------|-----------|------------|-------|-------|-------------|----------------|-------|-------| +| A: Simple chat | | | | | | | | | +| B: Multi-tool | | | | | | | | | +| C: GitHub read | | | | | | | | | +| D: Orchestra | | | | | | | | | +| E: Reasoning | | | | | | | | | +``` + +### Pass Criteria + +| Level | Requirement | +|-------|-------------| +| **PASS** | All 5 tasks meet their individual thresholds | +| **CONDITIONAL PASS** | 4/5 pass, the failing one is within 1.5× threshold | +| **FAIL** | 2+ tasks exceed threshold, or any task exceeds 2× threshold | + +### Comparison Notes + +If you have baseline measurements from before Phase 7B (pre-Feb 2026), record them here for delta analysis. Key metrics to compare: + +- **Task C iteration count**: Should drop from ~5–6 to ~2–3 (7B.4's main win) +- **Task B wall-clock**: Should drop from ~25s to ~15s (7B.1's parallel tool execution) +- **Task A model**: Should route to mini/flash instead of default model (7B.2) +- **Task D iteration count**: Should drop by ~40% (compound effect of all optimizations)