Skip to content

cwccie/test-time-compute-infrastructure

Repository files navigation

Test-Time Compute Infrastructure

Research Question: Can a 3B parameter model augmented with test-time compute strategies match a 70B parameter model on network troubleshooting tasks?

Problem Statement

Large language models (70B+ parameters) demonstrate strong performance on network troubleshooting scenarios -- routing loops, BGP peering failures, MTU mismatches, and similar diagnostic tasks. However, deploying 70B models requires expensive GPU infrastructure (multiple A100s or equivalent).

This project investigates whether a smaller 3B model, combined with test-time compute strategies (repeated sampling, voting, self-refinement), can close the accuracy gap at a fraction of the inference cost.

Architecture

                    +---------------------+
                    |   Scenario Engine    |
                    |  (6 failure modes)   |
                    +----------+----------+
                               |
                    +----------v----------+
                    |   Mock LLM Backend   |
                    |  3B / 70B simulated  |
                    +----------+----------+
                               |
              +----------------+----------------+
              |                |                |
     +--------v------+ +------v-------+ +------v--------+
     | Best-of-N     | | Majority     | | Self-         |
     | Sampling      | | Voting       | | Refinement    |
     +--------+------+ +------+-------+ +------+--------+
              |                |                |
     +--------v------+ +------v-------+ +------v--------+
     | Chain-of-      | | Tree-of-     | | Composite     |
     | Thought        | | Thought      | | Strategy      |
     +--------+------+ +------+-------+ +------+--------+
              |                |                |
              +----------------+----------------+
                               |
                    +----------v----------+
                    |   Scoring Engine     |
                    | accuracy / depth /   |
                    | completeness / time  |
                    +----------+----------+
                               |
                    +----------v----------+
                    |   Analysis Module    |
                    | cost-benefit curves  |
                    | scaling analysis     |
                    +---------------------+

Network Troubleshooting Scenarios

ID Scenario Root Cause Difficulty
NET-001 Routing loop between three routers Redistributed static route with missing filter Hard
NET-002 MTU black hole on IPsec tunnel Path MTU discovery blocked by firewall Medium
NET-003 BGP peering flap every 90 seconds Mismatched hold timers (default vs custom) Medium
NET-004 DNS resolution fails for internal hosts Split-horizon DNS with missing zone delegation Easy
NET-005 OSPF neighbor stuck in EXSTART MTU mismatch on OSPF interface Medium
NET-006 Intermittent ACL drops on return traffic Stateless ACL missing established/related rule Hard

Test-Time Compute Strategies

  1. Best-of-N Sampling -- Generate N candidate solutions, score each, return the best.
  2. Majority Voting -- Generate N solutions, return the most common diagnostic conclusion.
  3. Self-Refinement -- Generate a solution, critique it, then refine based on the critique.
  4. Chain-of-Thought -- Prepend structured reasoning steps before answering.
  5. Tree-of-Thought -- Branch into multiple reasoning paths, evaluate and prune, then converge.

Results Summary

Across 6 network troubleshooting scenarios with 50 evaluation runs each:

Configuration Accuracy Reasoning Depth Completeness Avg Latency (relative)
70B baseline 0.82 0.85 0.80 1.0x
3B baseline 0.45 0.50 0.42 0.15x
3B + Best-of-8 0.68 0.65 0.62 1.2x
3B + Majority Vote (N=8) 0.70 0.62 0.64 1.2x
3B + Self-Refine (2 rounds) 0.72 0.75 0.70 0.45x
3B + Chain-of-Thought 0.65 0.78 0.63 0.18x
3B + Tree-of-Thought (b=3, d=2) 0.74 0.80 0.72 0.9x
3B + Composite (CoT + Vote + Refine) 0.79 0.82 0.78 1.8x

Key finding: The composite strategy (chain-of-thought + majority voting + self-refinement) brings the 3B model within 3 percentage points of the 70B baseline on accuracy, at roughly 1.8x the single-inference latency of the larger model -- but at 10-15x lower hardware cost.

Installation

pip install -e ".[dev]"

Usage

# Run the full benchmark suite
ttc-bench run

# Run with specific strategy
ttc-bench run --strategy best-of-n --samples 8

# Analyze pre-computed results
ttc-bench analyze --results fixtures/precomputed_results.json

# Generate a summary report
ttc-bench report --output report.txt

Development

# Lint
ruff check src/ tests/

# Test
pytest -v

# Format
ruff format src/ tests/

Project Structure

src/test_time_compute_infrastructure/
    __init__.py          # Package root
    scenarios.py         # Network troubleshooting scenario definitions
    mock_llm.py          # Simulated 3B and 70B model backends
    strategies.py        # Test-time compute strategies
    scoring.py           # Multi-dimensional scoring engine
    analysis.py          # Cost-benefit and scaling analysis
    cli.py               # Click CLI (ttc-bench)
tests/
    test_scenarios.py    # Scenario validation tests
    test_mock_llm.py     # Mock LLM behavior tests
    test_strategies.py   # Strategy correctness tests
    test_scoring.py      # Scoring engine tests
    test_analysis.py     # Analysis module tests
    test_cli.py          # CLI integration tests
fixtures/
    precomputed_results.json   # Pre-computed benchmark results

License

MIT -- see LICENSE.

Citation

If you use this benchmark in your research:

@software{wade2026ttc,
  author = {Wade, Corey},
  title = {Test-Time Compute Infrastructure: Network Troubleshooting Benchmark},
  year = {2026},
  url = {https://github.com/cwccie/test-time-compute-infrastructure}
}

About

Can 3B + test-time compute match 70B on network troubleshooting?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages