GitHub - skyve2012/AutoAgentClaw: Automatic optimization for multi-agent systems — prompts, config, communication, topology. OpenClaw-native.

🦞 Automatic Multi-Agent System Optimization

Point it at your agent repo. It discovers what's tunable. It optimizes your agents.

AutoResearchClaw writes papers. CORAL evolves code. AutoAgentClaw optimizes agents.

🔥 News

[2026-03-24] Skill-based architecture: OpenClaw-native skills, self-improving learnings, skill-creator meta-skill
[2026-03-23] 3 verified benchmarks: HotpotQA (+29.3%), Customer Support (+5.1%), GSM8K (+7.1%)
[2026-03-22] Enhanced research pipeline: Semantic Scholar + arXiv + GitHub + PyPI search with provenance
[2026-03-22] 4 execution modes: per-experiment, autonomous, parallel, parallel-autonomous (CORAL-style)
[2026-03-21] 4-level optimization hierarchy (MaAS-informed) with algorithm registry
[2026-03-21] Cross-run learning with level-aware skills and algorithm metadata
[2026-03-21] Initial release with 12-stage pipeline, protected eval, sentinel watchdog

📊 Verified Results

AutoAgentClaw has been tested on real benchmarks with real LLM-based agent systems. All results use Claude subscription auth ($0 API cost).

Benchmark	Agents	Data	Baseline	After	Improvement	Key Technique
HotpotQA	2 (researcher + reasoner)	Real HotpotQA (20 questions)	0.5597	0.7236	+29.3%	Autonomous LLM + Optuna Bayesian
Customer Support	3 (classifier + responder + reviewer)	Synthetic (10 tickets)	0.8783	0.9233	+5.1%	Feedback loop + Autonomous LLM
GSM8K Math	3 (decomposer + solver + verifier)	Real GSM8K (15 problems)	0.9333	1.0000	+7.1%	Autonomous LLM (communication)

All examples are included in docs/examples/ — clone and run them yourself.

📈 What the optimizer discovered

HotpotQA — The reasoner produced correct but verbose answers (high accuracy, low F1). The optimizer:

(Level 1) Added output format constraints → +2.7%
(Level 2) Optuna found optimal temperature=0.2 + max_tokens=200 → +23.2%
(Level 1) Improved researcher information quality → +3.4%

Customer Support — Classification was perfect but response quality was weak. The optimizer:

(Level 2) Grid searched token budgets → 0% (config wasn't the bottleneck)
(Level 1) Feedback loop refined responder prompt → +2.8%
(Level 3) Autonomous LLM restructured reviewer approval logic → +5.1%

GSM8K Math — 14/15 problems correct, 1 failure on complex multi-step reasoning. The optimizer:

(Level 1) Tried prompt refinement → 0% (prompts weren't the issue)
(Level 2) Grid searched temperature/tokens → 0%
(Level 3) Autonomous LLM improved inter-agent communication → +7.1% (15/15 correct)

⚡ One Command. Optimized Agents.

pip install -e . && autoagent run \
  --target ~/my-agents \
  --eval eval.py \
  --metric accuracy \
  --direction maximize \
  --auto-approve

AutoAgentClaw reads your agent repo, discovers tunable parameters, researches optimization techniques, and runs experiments — all automatically. No rewriting required.

🤔 What Is This?

You have a multi-agent system. It works, but you want it to work better. AutoAgentClaw:

Step	What Happens	Inspired By
🔍 Discover	Reads your repo, finds agents, prompts, configs, topology	Zero-config (novel)
📚 Research	Searches Semantic Scholar + arXiv + web for optimization techniques	AutoResearchClaw
🧠 Strategize	Analyzes baseline, identifies bottleneck, plans level-by-level approach	ARIS + MaAS
⚙️ Optimize	CORAL-style autonomous workers run experiments in parallel worktrees	CORAL
📊 Track	Protected eval, sentinel watchdog, leaderboard, dashboard	CORAL
💡 Learn	Skills accumulate across runs — run N+1 is smarter than run N	MetaClaw

🦞 OpenClaw Integration

AutoAgentClaw is an OpenClaw-compatible service. Install it in OpenClaw and launch autonomous agent optimization with a single message — or use it standalone via CLI, Claude Code, or any AI coding assistant.

🚀 Use with OpenClaw (Recommended)

If you already use OpenClaw as your AI assistant:

1️⃣ Share the GitHub repo URL with OpenClaw 2️⃣ OpenClaw auto-reads AUTOAGENT_AGENTS.md → understands the optimization pipeline 3️⃣ Say: "Optimize the agents at ~/my-agents using eval.py" 4️⃣ Done — OpenClaw clones, installs, configures, runs, and returns results

That's it. OpenClaw handles git clone, pip install, config setup, and pipeline execution automatically. You just chat.

💡 What happens under the hood

OpenClaw reads AUTOAGENT_AGENTS.md → learns the agent optimizer role
OpenClaw reads README.md → understands installation and pipeline structure
OpenClaw copies config.autoagent.example.yaml → config.autoagent.yaml
Uses your Claude subscription (or asks for API key)
Runs pip install -e . + autoagent run --target <path> --eval <script>
Returns the optimization report, best configs, topology diffs, and leaderboard

🔌 OpenClaw Bridge (Advanced)

For deeper integration, AutoAgentClaw includes a bridge adapter system with 6 optional capabilities:

# config.autoagent.yaml
openclaw_bridge:
  use_cron: true              # ⏰ Scheduled optimization runs (overnight)
  use_message: true           # 💬 Progress notifications (Discord/Slack/Telegram)
  use_memory: true            # 🧠 Cross-session skill persistence
  use_sessions_spawn: true    # 🔀 Parallel optimization workers (CORAL-style)
  use_web_fetch: true         # 🌐 Research optimization techniques
  use_browser: false          # 🖥️ Not needed for optimization

Each flag activates a typed adapter protocol. When OpenClaw provides these capabilities, the adapters consume them without code changes.

ACP (Agent Client Protocol)

AutoAgentClaw can use any ACP-compatible coding agent as its LLM backend — no API keys required:

Agent	Command	Provider	Status
Claude Code	`claude`	Anthropic	✅ Tested
Codex CLI	`codex`	OpenAI	🔲 Supported (untested)
Copilot CLI	`gh`	GitHub	🔲 Supported (untested)
Gemini CLI	`gemini`	Google	🔲 Supported (untested)
OpenCode	`opencode`	Open Source	🔲 Supported (untested)
Kimi CLI	`kimi`	Moonshot	🔲 Supported (untested)

All benchmarks in this README were tested with Claude Code (subscription auth, $0 cost). Other ACP agents are supported via the same interface but have not been verified yet. Community testing welcome — see CONTRIBUTING.md.

🏗️ How It Works

12-Stage Pipeline

┌──────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  Phase A: Discovery          Phase B: Strategy                       │
│  ┌─────────┐ ┌─────────┐   ┌─────────────┐ ┌──────────┐           │
│  │ 1. Scan │→│ 2. Find │→  │ 4. Research  │→│ 5. Gate  │           │
│  │   Repo  │ │  Agents │   │ SS+arXiv+Web │ │ (human)  │           │
│  └─────────┘ └────┬────┘   └──────┬───────┘ └────┬─────┘           │
│                   │ 3.Gate         │               │                  │
│                   └────────────────┘               │                  │
│                                                    ▼                  │
│  Phase C: Optimization                                                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│  │ 6. Level 1   │→│ 7. Level 2   │→│ 8. Level 3   │→│ 9. Cross-  │ │
│  │   Behavior   │ │   Config     │ │   Comms      │ │  Validate  │ │
│  │ (parallel    │ │ (parallel    │ │ (parallel    │ │            │ │
│  │  workers)    │ │  workers)    │ │  workers)    │ │            │ │
│  └──────────────┘ └──────────────┘ └──────────────┘ └─────┬──────┘ │
│                                                            │        │
│  Phase D: Analysis           Phase E: Finalize             │        │
│  ┌──────────────┐ ┌────────┐ ┌────────────┐               │        │
│  │10. Extract   │→│11. Gen │→│12. Apply   │◄──────────────┘        │
│  │   Skills     │ │ Report │ │   Gate     │                         │
│  └──────────────┘ └────────┘ └────────────┘                         │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Phase	Stages	What Happens
A: Discovery	1-3	Read repo, discover agents & tunable params, classify by optimization level
B: Strategy	4-5	Multi-source research (Semantic Scholar + arXiv + web), LLM strategy planning
C: Optimization	6-9	Level-based optimization with CORAL-style parallel autonomous workers
D: Analysis	10-11	Extract skills with level + algorithm metadata, generate report
E: Finalization	12	Human reviews — apply to main, new branch, or reject

📊 4-Level Optimization Hierarchy (MaAS-Informed)

Not every problem needs all levels. The framework automatically determines which levels are relevant:

Level	What It Optimizes	Cost	Example
1: Agent Behavior	Prompts, instructions, output format	💚 Lowest	"Add conciseness constraint"
2: Agent Configuration	Temperature, max_tokens, model, tools	🟡 Moderate	"Reduce temperature to 0.3"
3: Inter-Agent Communication	What info is passed, message format	🟠 Higher	"Filter researcher output"
4: System Topology	Add/remove agents, restructure graph	🔴 Highest	"Add a review agent with feedback loop"

The framework optimizes Level 1 first (cheapest, highest ROI), then progresses to higher levels only if lower levels show diminishing returns.

🔄 4 Execution Modes

execution:
  worker_mode: "autonomous"      # per-experiment | autonomous
  max_parallel_workers: 3        # 1 = sequential, >1 = parallel

Mode	`worker_mode`	`max_parallel_workers`	Description
Sequential	`per-experiment`	1	One LLM call per experiment (fastest for small budgets)
Parallel	`per-experiment`	3	Multiple short-lived calls simultaneously
Autonomous	`autonomous`	1	CORAL-style long-lived session (recommended)
Full CORAL	`autonomous`	3	Parallel long-lived sessions in git worktrees

📚 Research Phase (Multi-Source)

Before optimizing, the framework researches techniques from multiple sources:

Source	What It Searches	API
Semantic Scholar	Academic papers on agent optimization	Free, no key
arXiv	Recent preprints	Free, no key
Claude Web Search	Blogs, GitHub repos, practical guides	Via Claude CLI
Algorithm Registry	Coded algorithms already installed	Local check

Every finding includes provenance (source URL, paper title, year, citations) for traceability.

Skills as research cache — second run skips research if matching skills exist from a previous run.

🧠 Self-Improving Skill System (OpenClaw-Native)

AutoAgentClaw uses a three-tier knowledge system inspired by OpenClaw skills and the self-improving agent pattern:

Tier	What	Loaded When	Lifetime
Default Skills	12 curated optimization techniques (CoT, output format, temperature tuning, etc.)	Always (descriptions in context)	Permanent
Project Skills	Research findings + learned principles for THIS project	On demand	Decay over 30 days
Learned Skills	Techniques promoted after working on 3+ projects	Always	Permanent until disproven

Each skill is a proper AgentSkills directory with SKILL.md + optional scripts/, references/, assets/:

autoagent/default_skills/
  L1-chain-of-thought/
    SKILL.md                # When to apply, how to apply, evidence
    references/evidence.md  # Papers, benchmark results
  L2-temperature-tuning/
    SKILL.md
    scripts/temp_sweep.py   # Quick temperature sweep script

Self-improving: After each run, the framework:

Records experiments in LEARNINGS.md (what worked, what failed)
Extracts reusable principles as new skills via the autoagent-skill-creator
Promotes skills to global _learned/ after 3+ project confirmations
Decays confidence on old skills — stale knowledge fades naturally

Project-scoped: Skills from HotpotQA don't leak into Customer Support. Each project has its own skills + learnings directory.

🛡️ Protected Evaluation & Sentinel

The eval script is copied to .autoagent/private/ — the optimizer can never modify it.

A Sentinel watchdog continuously monitors for:

🎯 Reward hacking (suspicious score jumps)
💰 Cost anomalies (unexpected API cost spikes)
🔒 Eval tampering (checksum mismatch)
📉 Score regression and oscillation

🔧 Pluggable Algorithm Registry

Add coded optimization algorithms alongside the LLM-as-optimizer:

from autoagent.algorithms import BaseOptimizer, register_optimizer

class MyOptimizer(BaseOptimizer):
    name = "my-optimizer"
    handles_levels = [1]        # Only Level 1 (behavior)
    handles_types = ["prompt"]

    def can_handle(self, target): return target.param_type == "prompt"
    def optimize(self, target, eval_fn, budget): ...

register_optimizer(MyOptimizer())

The framework handles everything else: eval, tracking, skills, dashboard. Skills remember which algorithm worked at which level for automatic selection in future runs.

🚀 Quick Start

Installation

git clone https://github.com/skyve2012/AutoAgentClaw.git
cd AutoAgentClaw
pip install -e .
autoagent setup    # Verify requirements

Run on the Demo

autoagent run \
  --target docs/examples/hotpotqa-agents \
  --eval eval.py \
  --metric score \
  --direction maximize \
  --max-experiments 8 \
  --auto-approve

The demo includes a deliberately suboptimal researcher + reasoner system that AutoAgentClaw optimizes by discovering better prompts, tuning parameters, and improving output format.

View Dashboard

autoagent dashboard --target docs/examples/hotpotqa-agents
# Opens http://localhost:3000 — topology viz, leaderboard, score chart

Run on Your Own Agents

autoagent run \
  --target ~/my-agents \
  --eval eval.py \
  --metric accuracy \
  --direction maximize

Requirements for your agent repo: An eval.py that returns a JSON with your metric (e.g., {"accuracy": 0.85}). That's it.

📋 Configuration

autoagent init    # Interactive config wizard

📝 Minimum required config

target:
  path: "~/my-agents"
  eval_script: "eval.py"
  metric: "accuracy"
  direction: "maximize"

📋 Full Configuration Reference

# === LLM Provider ===
llm:
  provider: "openai"
  base_url: "https://api.openai.com/v1"
  api_key_env: "OPENAI_API_KEY"
  primary_model: "gpt-4o"
  fallback_models: ["gpt-4o-mini"]

# === Target Agent System ===
target:
  path: "~/my-agents"
  eval_script: "eval.py"
  metric: "accuracy"
  direction: "maximize"

# === Search Space (auto-discovered if omitted) ===
search_space:
  freeze: []           # Param IDs to NOT optimize
  priority: []         # Optimize these first
  dimensions: {}       # Override algorithm per dimension

# === Optimization Budget ===
budget:
  max_experiments: 100
  max_time_minutes: 180
  max_cost_usd: 50.00
  pilot_experiments: 5

# === Execution ===
execution:
  worker_mode: "autonomous"   # per-experiment | autonomous
  max_parallel_workers: 3
  eval_timeout_sec: 300

# === Pipeline Gates ===
gates:
  auto_approve: false         # Set true for overnight runs

# === Knowledge / Cross-Run Learning ===
knowledge:
  enabled: true
  skills_dir: "~/.autoagent/skills"
  decay_rate: 0.05

# === Notifications ===
notifications:
  channel: "console"          # console | discord | slack
  notify_on: ["new_best_score", "optimization_complete"]

# === Dashboard ===
dashboard:
  port: 3000
  auto_open: true

# === OpenClaw Bridge ===
openclaw_bridge:
  use_cron: true
  use_message: true
  use_memory: true
  use_sessions_spawn: true
  use_web_fetch: true
  use_browser: false

# === Sentinel Watchdog ===
sentinel:
  enabled: true
  detect_reward_hacking: true
  detect_cost_anomaly: true

💻 CLI Reference

autoagent setup                       # Verify system requirements
autoagent init                        # Create config interactively
autoagent run [options]               # Run optimization pipeline
autoagent resume --target <path>      # Resume an interrupted run
autoagent dashboard [options]         # Start web dashboard
autoagent log --target <path>         # View experiment log + leaderboard
autoagent status --target <path>      # Show all runs + skill count
autoagent skills                      # List accumulated skills

🔧 autoagent run options

Flag	Description	Default
`--target, -t`	Path to target agent repo	(required)
`--eval`	Evaluation script path	`eval.py`
`--metric, -m`	Metric name to optimize	`accuracy`
`--direction, -d`	`maximize` or `minimize`	`maximize`
`--config`	Config file path	auto-detect
`--max-experiments`	Max experiments to run	`100`
`--max-cost`	Max cost in USD	`50.0`
`--max-time`	Max time in minutes	`180`
`--auto-approve`	Skip all approval gates	`false`
`--interactive, -i`	Interactive mode	`false`

🐍 Python API

from autoagent.pipeline.runner import PipelineRunner
from autoagent.config import load_config
from pathlib import Path

config = load_config("config.autoagent.yaml")
runner = PipelineRunner(config, Path("~/my-agents").expanduser())
result = runner.run()

print(f"Baseline: {result.baseline_score:.4f}")
print(f"Best:     {result.best_score:.4f} (+{result.improvement_pct:.1f}%)")

📂 Output Artifacts

# Per-project (in your agent repo)
.autoagent/
├── runs/
│   └── ao-YYYYMMDD-HHMMSS/
│       ├── discovery.json          # Agents, params, topology graph
│       ├── strategy.json           # Optimization plan (phases, algorithms)
│       ├── attempts.jsonl          # Every experiment (scores, diffs, feedback)
│       ├── leaderboard.json        # Top-10 ranked configurations
│       └── report.md               # Human-readable optimization report
├── private/
│   └── eval.py                     # Protected eval (tamper-proof copy)
└── knowledge/
    ├── attempts/                   # CORAL-style shared attempt records
    └── notes/                      # Worker observations and learnings

# Persistent knowledge (across runs + projects)
~/.autoagent/
├── projects/<project-hash>/
│   ├── skills/                     # Research + learned skills for THIS project
│   │   └── research-textgrad/SKILL.md
│   ├── .learnings/
│   │   ├── LEARNINGS.md            # What worked / failed (structured entries)
│   │   └── ERRORS.md               # What went wrong and why
│   └── lessons/
│       └── round-001.jsonl         # Structured experiment records per round
└── skills/
    ├── _default/                   # 12 curated optimization skills (ship with framework)
    └── _learned/                   # Skills promoted from 3+ projects

🔀 Multiple Ways to Run

Method	Command	Setup
🦞 OpenClaw	"Optimize my agents at ~/my-agents"	Zero — just chat
💻 CLI	`autoagent run --target ~/my-agents --eval eval.py`	`pip install -e .`
🐍 Python API	`PipelineRunner(config, path).run()`	`pip install -e .`
Claude Code	`claude` in the AutoAgentClaw directory	Clone repo
Codex CLI	`codex` in the AutoAgentClaw directory	Clone repo
Any AI Assistant	Point it at `AUTOAGENT_AGENTS.md`	Clone repo

📊 Architecture

AutoAgentClaw/
├── autoagent/
│   ├── cli.py                  # CLI: setup, init, run, dashboard, log, status, skills
│   ├── config.py               # YAML config (Pydantic models)
│   ├── models.py               # 4-level data models (MaAS hierarchy)
│   ├── pipeline/
│   │   ├── runner.py           # 12-stage pipeline orchestrator
│   │   └── stages.py          # Stage definitions with gate support
│   ├── agents/
│   │   ├── discovery_agent.py  # Auto-discovers agents, prompts, topology
│   │   ├── research_agent.py   # Multi-source research (SS + arXiv + web)
│   │   ├── strategy_agent.py   # Level-based optimization planning
│   │   ├── worker_agent.py     # CORAL-style LLM-as-optimizer workers
│   │   ├── autonomous_worker.py # Long-lived autonomous sessions
│   │   ├── prompt_opt_agent.py # Evolutionary fallback optimizer
│   │   └── reflect_agent.py   # Heartbeat reflection + convergence
│   ├── evaluator/
│   │   ├── runner.py           # Protected eval runner
│   │   ├── tracker.py          # Experiment tracking + leaderboard
│   │   └── sentinel.py        # Watchdog (reward hacking, cost, tampering)
│   ├── algorithms/
│   │   └── registry.py        # Pluggable algorithm registry
│   ├── knowledge/
│   │   ├── skill_manager.py    # Cross-run learning (extract, inject, decay)
│   │   └── shared.py          # CORAL-style shared knowledge filesystem
│   ├── bridge/
│   │   └── adapters.py        # OpenClaw bridge (6 typed adapter protocols)
│   ├── llm/
│   │   └── providers.py       # Multi-provider (Claude CLI, API, ACP, OpenAI)
│   ├── dashboard/
│   │   ├── server.py           # FastAPI backend (8 API endpoints)
│   │   └── static/index.html  # Dashboard UI (topology, charts, leaderboard)
│   └── workspace.py           # Git worktree manager (parallel workers)
├── AUTOAGENT_AGENTS.md         # OpenClaw service definition
├── CLAUDE.md                   # Claude Code instructions
├── config.autoagent.example.yaml
├── .claude/skills/
│   ├── autoagent/             # Main entry skill (OpenClaw integration)
│   ├── autoagent-skill-creator/ # Meta-skill for creating optimization skills
│   ├── autoagent-grid-search/ # Algorithm skill: systematic parameter search
│   ├── autoagent-feedback-loop/ # Algorithm skill: feedback-driven revision
│   └── autoagent-optuna-search/ # Algorithm skill: Bayesian optimization
├── docs/
│   ├── images/logo.png
│   └── examples/
│       ├── hotpotqa-agents/   # 2-agent QA (Real HotpotQA, +29.3%)
│       ├── math-reasoning/    # 3-agent math (Real GSM8K, +7.1%)
│       ├── customer-support/  # 3-agent support (Synthetic, +5.1%)
│       └── code-generation/   # 1-agent coder (Real HumanEval)
└── tests/                      # 27 unit tests

✨ Key Contributions

Contribution	Description
🔍 Zero-Config Agent Discovery	Reads any agent repo and auto-discovers tunable parameters — no framework lock-in, no rewriting required
📐 4-Level Optimization Hierarchy	MaAS-informed systematic approach: Behavior → Configuration → Communication → Topology, optimizing cheapest levels first
🔬 Research-Driven Strategy	Multi-source research (Semantic Scholar + arXiv + GitHub + PyPI) before optimization — the first agent optimizer to search for code implementations
🧠 Self-Improving Skill System	OpenClaw-native three-tier knowledge: curated defaults → project-specific learnings → cross-project promotions with confidence decay
🤖 Autonomous LLM-as-Optimizer	CORAL-style long-lived Claude sessions that read code, propose changes, evaluate, and iterate — achieving 200-700% improvements
🔄 Pluggable Algorithm Skills	Algorithms are installable skills with `SKILL.md` + `scripts/` — add a new optimizer by creating a directory
🛡️ Protected Eval + Sentinel	Tamper-proof evaluation with watchdog monitoring for reward hacking, cost anomalies, and score oscillation
📊 Live Dashboard	Real-time topology visualization, performance charts, experiment leaderboard, and dimension heatmaps
🦞 OpenClaw Native	Works with Claude subscription ($0 API cost), supports 6 ACP agents, bridge adapters for parallel execution

📋 Roadmap

Core Pipeline

12-stage pipeline with human approval gates
Zero-config agent discovery (LLM-driven + regex fallback)
Multi-source research (Semantic Scholar + arXiv + GitHub + PyPI)
4-level optimization hierarchy (MaAS-informed)
Protected evaluator with checksum verification
Sentinel watchdog (reward hacking, cost anomaly, score oscillation)
Cross-validation with git snapshot restore
Optimization report generation

Execution Modes

Per-experiment mode (one LLM call per experiment)
Autonomous mode (CORAL-style long-lived Claude sessions)
Parallel workers via git worktrees
SIGINT interrupt-resume heartbeat (CORAL-style mid-session reflection)
autoagent resume command for crashed/interrupted runs

Skill System

Algorithms

LLM-as-optimizer (autonomous Claude sessions)
Grid search (systematic parameter sweep)
Feedback loop (iterative LLM revision)
Optuna Bayesian optimization
Pluggable algorithm registry with auto-discovery
Integration with real TextGrad library
Integration with DSPy MIPROv2
MCTS-based topology search (AFlow-style)
Multi-objective Pareto optimization

Integration & UX

OpenClaw integration (AUTOAGENT_AGENTS.md + bridge adapters)
ACP support (Claude, Codex, Gemini, OpenCode, Kimi)
Claude subscription auth ($0 cost)
Web dashboard (topology viz, leaderboard, charts)
CLI (setup, init, run, dashboard, log, status, skills)
Verify and test non-Claude ACP agents (Codex, Gemini, etc.)
Docker/SSH remote execution for heavy evaluations
ClaweHub skill publishing
Multi-language README (CN, JA, KR)

Benchmarks & Examples

HotpotQA — 2-agent QA (real data, +29.3%)
GSM8K Math — 3-agent reasoning (real data, +7.1%)
Customer Support — 3-agent triage (synthetic, +5.1%)
Code Generation — 1-agent coder (real HumanEval data)
GAIA benchmark (compare with EvoAgentX results)
SWE-bench (software engineering agents)
WebArena (web navigation agents)
Multi-agent debate optimization benchmark

Advanced Features (Future)

Multi-round auto-continue (run N rounds until diminishing returns)
Cross-model adversarial review (Claude optimizes, GPT reviews)
Pilot experiments before strategy commitment
Multi-agent debate for strategy planning
Fine-tuning integration (Level 5: RL/SFT on agents)
Cost-aware Pareto optimization (quality vs latency vs cost)

🙏 Acknowledgments

AutoAgentClaw builds on ideas from:

CORAL — Multi-agent evolution infrastructure (parallel workers, protected eval, shared knowledge, heartbeat)
AutoResearchClaw — OpenClaw integration, MetaClaw cross-run learning, staged pipeline, bridge adapters
ResearchClaw — Claim/evidence graph, provenance tracking, experiment contracts
ARIS — Research-driven strategy, overnight autonomous runs
Karpathy's autoresearch — The outer-loop optimizer pattern
MaAS — 4-level optimization hierarchy for multi-agent systems
EvoAgentX — Pluggable algorithm registry pattern
OpenClaw Skill System — AgentSkills standard, self-improving skill pattern

📄 License

MIT

📝 Citation

If you find AutoAgentClaw useful in your research or projects, please cite:

@software{shen2026autoagentclaw,
  title     = {AutoAgentClaw: Automatic Multi-Agent System Optimization},
  author    = {Shen, Hongyu},
  year      = {2026},
  url       = {https://github.com/skyve2012/AutoAgentClaw},
  note      = {A framework for automatically optimizing agent systems through
               zero-config discovery, research-driven strategy, and self-improving
               skills. Built on OpenClaw with CORAL-style parallel execution and
               MaAS-informed 4-level optimization hierarchy.}
}

_{Built with 🦞 by Hongyu Shen}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude		.claude
autoagent		autoagent
docs		docs
tests		tests
.gitignore		.gitignore
AUTOAGENT_AGENTS.md		AUTOAGENT_AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PLAN-SKILL-RESTRUCTURE.md		PLAN-SKILL-RESTRUCTURE.md
PLAN.md		PLAN.md
README.md		README.md
config.autoagent.example.yaml		config.autoagent.example.yaml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🦞 Automatic Multi-Agent System Optimization

🔥 News

📊 Verified Results

⚡ One Command. Optimized Agents.

🤔 What Is This?

🦞 OpenClaw Integration

🚀 Use with OpenClaw (Recommended)

🔌 OpenClaw Bridge (Advanced)

ACP (Agent Client Protocol)

🏗️ How It Works

12-Stage Pipeline

📊 4-Level Optimization Hierarchy (MaAS-Informed)

🔄 4 Execution Modes

📚 Research Phase (Multi-Source)

🧠 Self-Improving Skill System (OpenClaw-Native)

🛡️ Protected Evaluation & Sentinel

🔧 Pluggable Algorithm Registry

🚀 Quick Start

Installation

Run on the Demo

View Dashboard

Run on Your Own Agents

📋 Configuration

💻 CLI Reference

🐍 Python API

📂 Output Artifacts

🔀 Multiple Ways to Run

📊 Architecture

✨ Key Contributions

📋 Roadmap

Core Pipeline

Execution Modes

Skill System

Algorithms

Integration & UX

Benchmarks & Examples

Advanced Features (Future)

🙏 Acknowledgments

📄 License

📝 Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages