From 3b15e9c06efd040519c6c07793fe97a5b827b748 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 25 Feb 2026 12:24:18 +0000
Subject: [PATCH] docs(test): add Phase 7B.6 latency benchmark protocol
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

5 representative tasks testing each 7B optimization:
- Task A: Simple chat → 7B.2 model routing (< 5s, fast model)
- Task B: Multi-tool → 7B.1 speculative execution (< 20s, 2 tools/1 iter)
- Task C: GitHub read → 7B.3+7B.4 prefetch+injection (< 30s, ≤ 3 iter)
- Task D: Orchestra → all optimizations end-to-end (< 3min, ≤ 15 iter)
- Task E: Reasoning → 7B.5 streaming feedback (first update < 3s)

Includes pass/conditional/fail criteria and comparison notes.

https://claude.ai/code/session_01K2mQTABDGY7DnnposPdDjw
---
 TEST_PROTOCOL.md | 158 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)

diff --git a/TEST_PROTOCOL.md b/TEST_PROTOCOL.md
index 0df43aba3..8e0cfea82 100644
--- a/TEST_PROTOCOL.md
+++ b/TEST_PROTOCOL.md
@@ -136,3 +136,161 @@ Copy this table, fill in as you go:
 ```
 
 **Pass criteria:** All 40 tests pass. If any fail, note the exact response and which model was active.
+
+---
+
+## 11. Phase 7B.6 — Latency Benchmark Protocol
+
+> **Human checkpoint 7B.6:** Benchmark before/after — measure end-to-end latency on 5 representative tasks.
+>
+> Validates that Phase 7B speed optimizations (speculative execution, model routing,
+> file prefetching, iteration reduction, streaming feedback) deliver real-world improvement.
+
+### Prerequisites
+
+- Deploy the current build with all 7B optimizations enabled
+- Use Telegram (production path — Workers + Durable Objects)
+- Run `/new` before each test to start with clean context
+- Note the Cloudflare region (Workers dashboard → Analytics)
+
+### What to Record
+
+For each task, capture from the final response footer:
+
+| Field | Source |
+|-------|--------|
+| **Wall-clock (s)** | `⏱️ Xs` in response footer |
+| **Iterations** | `(N iter)` in response footer |
+| **Tools used** | `[Used N tool(s): ...]` header |
+| **Model** | `🤖 /alias` in footer |
+| **Token cost** | Cost footer (if shown) |
+
+Also note from the Telegram UX:
+- **Time-to-first-update**: seconds from send until first "⏳" status appears
+- **Progress clarity**: could you tell what the bot was doing? (Y/N)
+
+### The 5 Benchmark Tasks
+
+#### Task A: Simple Chat (tests 7B.2 — model routing)
+
+```
+/use auto
+What is the capital of France?
+```
+
+| Metric | Expected |
+|--------|----------|
+| Wall-clock | < 5s |
+| Iterations | 1 |
+| Tools | 0 |
+| Model | mini, flash, or haiku (NOT deep/gpt/sonnet) |
+
+**What 7B.2 does:** Routes simple queries to a fast model instead of the default heavyweight.
+**Pass:** Response arrives in ≤ 5s AND model shown is a fast candidate (mini/flash/haiku).
+
+---
+
+#### Task B: Multi-Tool Research (tests 7B.1 — speculative execution)
+
+```
+/use deep
+What's the weather in Prague and what's Bitcoin trading at?
+```
+
+| Metric | Expected |
+|--------|----------|
+| Wall-clock | < 20s |
+| Iterations | 1–2 |
+| Tools | 2 (get_weather, get_crypto) |
+
+**What 7B.1 does:** Starts tool execution during streaming — both tools should fire in parallel before the full response arrives.
+**Pass:** Both tools called in a single iteration, wall-clock noticeably lower than 2× single-tool time.
+
+---
+
+#### Task C: GitHub File Reading (tests 7B.3 + 7B.4 — prefetch + injection)
+
+```
+/use deep
+Read the README.md and package.json from PetrAnto/moltworker and summarize the project stack
+```
+
+| Metric | Expected (with 7B) | Baseline (without 7B) |
+|--------|--------------------|-----------------------|
+| Wall-clock | < 30s | ~45–60s |
+| Iterations | 1–3 | 4–6 |
+| Tools | 2–4 | 4–6 |
+
+**What 7B.3 + 7B.4 do:** File paths are extracted from the user message, GitHub reads start in parallel with the first LLM call, and file contents are injected into context at the plan→work transition — so the model doesn't need separate `github_read_file` iterations.
+**Pass:** Iteration count ≤ 3 AND wall-clock under 30s.
+
+---
+
+#### Task D: Orchestra Run (tests all 7B optimizations end-to-end)
+
+Pick a repo with a ROADMAP.md (e.g., one previously initialized with `/orchestra init`):
+
+```
+/orchestra run <owner>/<repo>
+```
+
+| Metric | Expected (with 7B) | Baseline (without 7B) |
+|--------|--------------------|-----------------------|
+| Wall-clock | < 3 min | ~4–6 min |
+| Iterations | 8–15 | 15–25 |
+| Tools | 5–15 | 10–25 |
+
+**What the full stack does:** File prefetch on roadmap/work-log reads, speculative execution on parallel-safe tool calls, fewer iterations due to injected file contents, streaming progress updates throughout.
+**Pass:** Iteration count ≤ 15 AND progress messages showed meaningful context (tool names, plan steps).
+
+---
+
+#### Task E: Non-Tool Reasoning (tests 7B.5 — streaming feedback + baseline)
+
+```
+/use deep
+think:high Compare the architectural trade-offs between microservices and monoliths for a team of 5 developers building a SaaS product. Consider deployment complexity, debugging, and team velocity.
+```
+
+| Metric | Expected |
+|--------|----------|
+| Wall-clock | < 30s |
+| Iterations | 1 |
+| Tools | 0 |
+| Time-to-first-update | < 3s |
+
+**What 7B.5 does:** Even with no tools, the streaming feedback shows the user a "⏳ 📋 Planning…" or "⏳ Thinking…" status within seconds.
+**Pass:** First status message appears in ≤ 3s AND final response is substantive.
+
+---
+
+### Results Table
+
+Copy and fill in:
+
+```
+| Task | Wall-clock | Iterations | Tools | Model | First-update | Progress clear? | Pass? | Notes |
+|------|-----------|------------|-------|-------|-------------|----------------|-------|-------|
+| A: Simple chat | | | | | | | | |
+| B: Multi-tool | | | | | | | | |
+| C: GitHub read | | | | | | | | |
+| D: Orchestra | | | | | | | | |
+| E: Reasoning | | | | | | | | |
+```
+
+### Pass Criteria
+
+| Level | Requirement |
+|-------|-------------|
+| **PASS** | All 5 tasks meet their individual thresholds |
+| **CONDITIONAL PASS** | 4/5 pass, the failing one is within 1.5× threshold |
+| **FAIL** | 2+ tasks exceed threshold, or any task exceeds 2× threshold |
+
+### Comparison Notes
+
+If you have baseline measurements from before Phase 7B (pre-Feb 2026), record them here for delta analysis. Key metrics to compare:
+
+- **Task C iteration count**: Should drop from ~5–6 to ~2–3 (7B.4's main win)
+- **Task B wall-clock**: Should drop from ~25s to ~15s (7B.1's parallel tool execution)
+- **Task A model**: Should route to mini/flash instead of default model (7B.2)
+- **Task D iteration count**: Should drop by ~40% (compound effect of all optimizations)