Test-Time Compute Infrastructure

Research Question: Can a 3B parameter model augmented with test-time compute strategies match a 70B parameter model on network troubleshooting tasks?

Problem Statement

Large language models (70B+ parameters) demonstrate strong performance on network troubleshooting scenarios -- routing loops, BGP peering failures, MTU mismatches, and similar diagnostic tasks. However, deploying 70B models requires expensive GPU infrastructure (multiple A100s or equivalent).

This project investigates whether a smaller 3B model, combined with test-time compute strategies (repeated sampling, voting, self-refinement), can close the accuracy gap at a fraction of the inference cost.

Architecture

                    +---------------------+
                    |   Scenario Engine    |
                    |  (6 failure modes)   |
                    +----------+----------+
                               |
                    +----------v----------+
                    |   Mock LLM Backend   |
                    |  3B / 70B simulated  |
                    +----------+----------+
                               |
              +----------------+----------------+
              |                |                |
     +--------v------+ +------v-------+ +------v--------+
     | Best-of-N     | | Majority     | | Self-         |
     | Sampling      | | Voting       | | Refinement    |
     +--------+------+ +------+-------+ +------+--------+
              |                |                |
     +--------v------+ +------v-------+ +------v--------+
     | Chain-of-      | | Tree-of-     | | Composite     |
     | Thought        | | Thought      | | Strategy      |
     +--------+------+ +------+-------+ +------+--------+
              |                |                |
              +----------------+----------------+
                               |
                    +----------v----------+
                    |   Scoring Engine     |
                    | accuracy / depth /   |
                    | completeness / time  |
                    +----------+----------+
                               |
                    +----------v----------+
                    |   Analysis Module    |
                    | cost-benefit curves  |
                    | scaling analysis     |
                    +---------------------+

Network Troubleshooting Scenarios

ID	Scenario	Root Cause	Difficulty
NET-001	Routing loop between three routers	Redistributed static route with missing filter	Hard
NET-002	MTU black hole on IPsec tunnel	Path MTU discovery blocked by firewall	Medium
NET-003	BGP peering flap every 90 seconds	Mismatched hold timers (default vs custom)	Medium
NET-004	DNS resolution fails for internal hosts	Split-horizon DNS with missing zone delegation	Easy
NET-005	OSPF neighbor stuck in EXSTART	MTU mismatch on OSPF interface	Medium
NET-006	Intermittent ACL drops on return traffic	Stateless ACL missing established/related rule	Hard

Test-Time Compute Strategies

Best-of-N Sampling -- Generate N candidate solutions, score each, return the best.
Majority Voting -- Generate N solutions, return the most common diagnostic conclusion.
Self-Refinement -- Generate a solution, critique it, then refine based on the critique.
Chain-of-Thought -- Prepend structured reasoning steps before answering.
Tree-of-Thought -- Branch into multiple reasoning paths, evaluate and prune, then converge.

Results Summary

Across 6 network troubleshooting scenarios with 50 evaluation runs each:

Configuration	Accuracy	Reasoning Depth	Completeness	Avg Latency (relative)
70B baseline	0.82	0.85	0.80	1.0x
3B baseline	0.45	0.50	0.42	0.15x
3B + Best-of-8	0.68	0.65	0.62	1.2x
3B + Majority Vote (N=8)	0.70	0.62	0.64	1.2x
3B + Self-Refine (2 rounds)	0.72	0.75	0.70	0.45x
3B + Chain-of-Thought	0.65	0.78	0.63	0.18x
3B + Tree-of-Thought (b=3, d=2)	0.74	0.80	0.72	0.9x
3B + Composite (CoT + Vote + Refine)	0.79	0.82	0.78	1.8x

Key finding: The composite strategy (chain-of-thought + majority voting + self-refinement) brings the 3B model within 3 percentage points of the 70B baseline on accuracy, at roughly 1.8x the single-inference latency of the larger model -- but at 10-15x lower hardware cost.

Installation

pip install -e ".[dev]"

Usage

# Run the full benchmark suite
ttc-bench run

# Run with specific strategy
ttc-bench run --strategy best-of-n --samples 8

# Analyze pre-computed results
ttc-bench analyze --results fixtures/precomputed_results.json

# Generate a summary report
ttc-bench report --output report.txt

Development

# Lint
ruff check src/ tests/

# Test
pytest -v

# Format
ruff format src/ tests/

Project Structure

src/test_time_compute_infrastructure/
    __init__.py          # Package root
    scenarios.py         # Network troubleshooting scenario definitions
    mock_llm.py          # Simulated 3B and 70B model backends
    strategies.py        # Test-time compute strategies
    scoring.py           # Multi-dimensional scoring engine
    analysis.py          # Cost-benefit and scaling analysis
    cli.py               # Click CLI (ttc-bench)
tests/
    test_scenarios.py    # Scenario validation tests
    test_mock_llm.py     # Mock LLM behavior tests
    test_strategies.py   # Strategy correctness tests
    test_scoring.py      # Scoring engine tests
    test_analysis.py     # Analysis module tests
    test_cli.py          # CLI integration tests
fixtures/
    precomputed_results.json   # Pre-computed benchmark results

License

MIT -- see LICENSE.

Citation

If you use this benchmark in your research:

@software{wade2026ttc,
  author = {Wade, Corey},
  title = {Test-Time Compute Infrastructure: Network Troubleshooting Benchmark},
  year = {2026},
  url = {https://github.com/cwccie/test-time-compute-infrastructure}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
fixtures		fixtures
src/test_time_compute_infrastructure		src/test_time_compute_infrastructure
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Test-Time Compute Infrastructure

Problem Statement

Architecture

Network Troubleshooting Scenarios

Test-Time Compute Strategies

Results Summary

Installation

Usage

Development

Project Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Test-Time Compute Infrastructure

Problem Statement

Architecture

Network Troubleshooting Scenarios

Test-Time Compute Strategies

Results Summary

Installation

Usage

Development

Project Structure

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages