Skip to content
/ ate Public

Agent Teams Eval: comparing Claude Code Agent Teams vs subagents for bug-fixing on Ruff. Ceiling effect — 8/8 solve rate, zero peer communication.

Notifications You must be signed in to change notification settings

kar-ganap/ate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agent Teams Eval (ate)

Experimental comparison of Claude Code with Agent Teams (symmetric peers) vs default Claude Code (hub-and-spoke subagents) for bug triage and fix on the Ruff Python linter (Rust codebase).

Key finding: Ceiling effect — Claude Opus 4.6 solves all 8 bugs (8/8) in under 10 minutes regardless of treatment condition. Zero peer-to-peer communication observed across all team treatments. Agent Teams functions as a parallelism engine, not a collaboration tool, for tasks within the model's capability frontier.

First in the ate-series. Successors: ate-features (feature implementation), ate-arch (architecture design).

Results at a Glance

Treatment Fix Rate Mean Time Peer Messages
0b (solo control) 8/8 49 min N/A
2a (2-agent team) 8/8 16 min 0
5 (max parallelism) 8/8 9.5 min 0

See findings for full analysis and experiment-design.md for the protocol.

Quick Start

uv sync --group dev
uv run ate bugs list
uv run ate treatments list

Built On

  • Claude Code — agentic coding tool
  • Agent Teams — multi-agent collaboration feature under study
  • Subagents — the default hub-and-spoke delegation mechanism

Validation Gates

make test       # Unit tests (162)
make test-int   # Integration tests (requires built Ruff)
make lint       # Ruff linter
make typecheck  # mypy strict

About

Agent Teams Eval: comparing Claude Code Agent Teams vs subagents for bug-fixing on Ruff. Ceiling effect — 8/8 solve rate, zero peer communication.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors