Skip to content

skyve2012/AutoAgentClaw

Repository files navigation

AutoAgentClaw Logo

🦞 Automatic Multi-Agent System Optimization

Point it at your agent repo. It discovers what's tunable. It optimizes your agents.

MIT License Python 3.11+ Tests OpenClaw ACP 4 Levels

AutoResearchClaw writes papers. CORAL evolves code. AutoAgentClaw optimizes agents.


🔥 News

  • [2026-03-24] Skill-based architecture: OpenClaw-native skills, self-improving learnings, skill-creator meta-skill
  • [2026-03-23] 3 verified benchmarks: HotpotQA (+29.3%), Customer Support (+5.1%), GSM8K (+7.1%)
  • [2026-03-22] Enhanced research pipeline: Semantic Scholar + arXiv + GitHub + PyPI search with provenance
  • [2026-03-22] 4 execution modes: per-experiment, autonomous, parallel, parallel-autonomous (CORAL-style)
  • [2026-03-21] 4-level optimization hierarchy (MaAS-informed) with algorithm registry
  • [2026-03-21] Cross-run learning with level-aware skills and algorithm metadata
  • [2026-03-21] Initial release with 12-stage pipeline, protected eval, sentinel watchdog

📊 Verified Results

AutoAgentClaw has been tested on real benchmarks with real LLM-based agent systems. All results use Claude subscription auth ($0 API cost).

Benchmark Agents Data Baseline After Improvement Key Technique
HotpotQA 2 (researcher + reasoner) Real HotpotQA (20 questions) 0.5597 0.7236 +29.3% Autonomous LLM + Optuna Bayesian
Customer Support 3 (classifier + responder + reviewer) Synthetic (10 tickets) 0.8783 0.9233 +5.1% Feedback loop + Autonomous LLM
GSM8K Math 3 (decomposer + solver + verifier) Real GSM8K (15 problems) 0.9333 1.0000 +7.1% Autonomous LLM (communication)

All examples are included in docs/examples/ — clone and run them yourself.

📈 What the optimizer discovered

HotpotQA — The reasoner produced correct but verbose answers (high accuracy, low F1). The optimizer:

  1. (Level 1) Added output format constraints → +2.7%
  2. (Level 2) Optuna found optimal temperature=0.2 + max_tokens=200 → +23.2%
  3. (Level 1) Improved researcher information quality → +3.4%

Customer Support — Classification was perfect but response quality was weak. The optimizer:

  1. (Level 2) Grid searched token budgets → 0% (config wasn't the bottleneck)
  2. (Level 1) Feedback loop refined responder prompt → +2.8%
  3. (Level 3) Autonomous LLM restructured reviewer approval logic → +5.1%

GSM8K Math — 14/15 problems correct, 1 failure on complex multi-step reasoning. The optimizer:

  1. (Level 1) Tried prompt refinement → 0% (prompts weren't the issue)
  2. (Level 2) Grid searched temperature/tokens → 0%
  3. (Level 3) Autonomous LLM improved inter-agent communication → +7.1% (15/15 correct)

⚡ One Command. Optimized Agents.

pip install -e . && autoagent run \
  --target ~/my-agents \
  --eval eval.py \
  --metric accuracy \
  --direction maximize \
  --auto-approve

AutoAgentClaw reads your agent repo, discovers tunable parameters, researches optimization techniques, and runs experiments — all automatically. No rewriting required.


🤔 What Is This?

You have a multi-agent system. It works, but you want it to work better. AutoAgentClaw:

Step What Happens Inspired By
🔍 Discover Reads your repo, finds agents, prompts, configs, topology Zero-config (novel)
📚 Research Searches Semantic Scholar + arXiv + web for optimization techniques AutoResearchClaw
🧠 Strategize Analyzes baseline, identifies bottleneck, plans level-by-level approach ARIS + MaAS
⚙️ Optimize CORAL-style autonomous workers run experiments in parallel worktrees CORAL
📊 Track Protected eval, sentinel watchdog, leaderboard, dashboard CORAL
💡 Learn Skills accumulate across runs — run N+1 is smarter than run N MetaClaw

🦞 OpenClaw Integration

AutoAgentClaw is an OpenClaw-compatible service. Install it in OpenClaw and launch autonomous agent optimization with a single message — or use it standalone via CLI, Claude Code, or any AI coding assistant.

🚀 Use with OpenClaw (Recommended)

If you already use OpenClaw as your AI assistant:

1️⃣ Share the GitHub repo URL with OpenClaw 2️⃣ OpenClaw auto-reads AUTOAGENT_AGENTS.md → understands the optimization pipeline 3️⃣ Say: "Optimize the agents at ~/my-agents using eval.py" 4️⃣ Done — OpenClaw clones, installs, configures, runs, and returns results

That's it. OpenClaw handles git clone, pip install, config setup, and pipeline execution automatically. You just chat.

💡 What happens under the hood
  1. OpenClaw reads AUTOAGENT_AGENTS.md → learns the agent optimizer role
  2. OpenClaw reads README.md → understands installation and pipeline structure
  3. OpenClaw copies config.autoagent.example.yamlconfig.autoagent.yaml
  4. Uses your Claude subscription (or asks for API key)
  5. Runs pip install -e . + autoagent run --target <path> --eval <script>
  6. Returns the optimization report, best configs, topology diffs, and leaderboard

🔌 OpenClaw Bridge (Advanced)

For deeper integration, AutoAgentClaw includes a bridge adapter system with 6 optional capabilities:

# config.autoagent.yaml
openclaw_bridge:
  use_cron: true              # ⏰ Scheduled optimization runs (overnight)
  use_message: true           # 💬 Progress notifications (Discord/Slack/Telegram)
  use_memory: true            # 🧠 Cross-session skill persistence
  use_sessions_spawn: true    # 🔀 Parallel optimization workers (CORAL-style)
  use_web_fetch: true         # 🌐 Research optimization techniques
  use_browser: false          # 🖥️ Not needed for optimization

Each flag activates a typed adapter protocol. When OpenClaw provides these capabilities, the adapters consume them without code changes.

ACP (Agent Client Protocol)

AutoAgentClaw can use any ACP-compatible coding agent as its LLM backend — no API keys required:

Agent Command Provider Status
Claude Code claude Anthropic Tested
Codex CLI codex OpenAI 🔲 Supported (untested)
Copilot CLI gh GitHub 🔲 Supported (untested)
Gemini CLI gemini Google 🔲 Supported (untested)
OpenCode opencode Open Source 🔲 Supported (untested)
Kimi CLI kimi Moonshot 🔲 Supported (untested)

All benchmarks in this README were tested with Claude Code (subscription auth, $0 cost). Other ACP agents are supported via the same interface but have not been verified yet. Community testing welcome — see CONTRIBUTING.md.


🏗️ How It Works

12-Stage Pipeline

┌──────────────────────────────────────────────────────────────────────┐
│                                                                      │
│  Phase A: Discovery          Phase B: Strategy                       │
│  ┌─────────┐ ┌─────────┐   ┌─────────────┐ ┌──────────┐           │
│  │ 1. Scan │→│ 2. Find │→  │ 4. Research  │→│ 5. Gate  │           │
│  │   Repo  │ │  Agents │   │ SS+arXiv+Web │ │ (human)  │           │
│  └─────────┘ └────┬────┘   └──────┬───────┘ └────┬─────┘           │
│                   │ 3.Gate         │               │                  │
│                   └────────────────┘               │                  │
│                                                    ▼                  │
│  Phase C: Optimization                                                │
│  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│  │ 6. Level 1   │→│ 7. Level 2   │→│ 8. Level 3   │→│ 9. Cross-  │ │
│  │   Behavior   │ │   Config     │ │   Comms      │ │  Validate  │ │
│  │ (parallel    │ │ (parallel    │ │ (parallel    │ │            │ │
│  │  workers)    │ │  workers)    │ │  workers)    │ │            │ │
│  └──────────────┘ └──────────────┘ └──────────────┘ └─────┬──────┘ │
│                                                            │        │
│  Phase D: Analysis           Phase E: Finalize             │        │
│  ┌──────────────┐ ┌────────┐ ┌────────────┐               │        │
│  │10. Extract   │→│11. Gen │→│12. Apply   │◄──────────────┘        │
│  │   Skills     │ │ Report │ │   Gate     │                         │
│  └──────────────┘ └────────┘ └────────────┘                         │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
Phase Stages What Happens
A: Discovery 1-3 Read repo, discover agents & tunable params, classify by optimization level
B: Strategy 4-5 Multi-source research (Semantic Scholar + arXiv + web), LLM strategy planning
C: Optimization 6-9 Level-based optimization with CORAL-style parallel autonomous workers
D: Analysis 10-11 Extract skills with level + algorithm metadata, generate report
E: Finalization 12 Human reviews — apply to main, new branch, or reject

📊 4-Level Optimization Hierarchy (MaAS-Informed)

Not every problem needs all levels. The framework automatically determines which levels are relevant:

Level What It Optimizes Cost Example
1: Agent Behavior Prompts, instructions, output format 💚 Lowest "Add conciseness constraint"
2: Agent Configuration Temperature, max_tokens, model, tools 🟡 Moderate "Reduce temperature to 0.3"
3: Inter-Agent Communication What info is passed, message format 🟠 Higher "Filter researcher output"
4: System Topology Add/remove agents, restructure graph 🔴 Highest "Add a review agent with feedback loop"

The framework optimizes Level 1 first (cheapest, highest ROI), then progresses to higher levels only if lower levels show diminishing returns.

🔄 4 Execution Modes

execution:
  worker_mode: "autonomous"      # per-experiment | autonomous
  max_parallel_workers: 3        # 1 = sequential, >1 = parallel
Mode worker_mode max_parallel_workers Description
Sequential per-experiment 1 One LLM call per experiment (fastest for small budgets)
Parallel per-experiment 3 Multiple short-lived calls simultaneously
Autonomous autonomous 1 CORAL-style long-lived session (recommended)
Full CORAL autonomous 3 Parallel long-lived sessions in git worktrees

📚 Research Phase (Multi-Source)

Before optimizing, the framework researches techniques from multiple sources:

Source What It Searches API
Semantic Scholar Academic papers on agent optimization Free, no key
arXiv Recent preprints Free, no key
Claude Web Search Blogs, GitHub repos, practical guides Via Claude CLI
Algorithm Registry Coded algorithms already installed Local check

Every finding includes provenance (source URL, paper title, year, citations) for traceability.

Skills as research cache — second run skips research if matching skills exist from a previous run.

🧠 Self-Improving Skill System (OpenClaw-Native)

AutoAgentClaw uses a three-tier knowledge system inspired by OpenClaw skills and the self-improving agent pattern:

Tier What Loaded When Lifetime
Default Skills 12 curated optimization techniques (CoT, output format, temperature tuning, etc.) Always (descriptions in context) Permanent
Project Skills Research findings + learned principles for THIS project On demand Decay over 30 days
Learned Skills Techniques promoted after working on 3+ projects Always Permanent until disproven

Each skill is a proper AgentSkills directory with SKILL.md + optional scripts/, references/, assets/:

autoagent/default_skills/
  L1-chain-of-thought/
    SKILL.md                # When to apply, how to apply, evidence
    references/evidence.md  # Papers, benchmark results
  L2-temperature-tuning/
    SKILL.md
    scripts/temp_sweep.py   # Quick temperature sweep script

Self-improving: After each run, the framework:

  1. Records experiments in LEARNINGS.md (what worked, what failed)
  2. Extracts reusable principles as new skills via the autoagent-skill-creator
  3. Promotes skills to global _learned/ after 3+ project confirmations
  4. Decays confidence on old skills — stale knowledge fades naturally

Project-scoped: Skills from HotpotQA don't leak into Customer Support. Each project has its own skills + learnings directory.

🛡️ Protected Evaluation & Sentinel

The eval script is copied to .autoagent/private/ — the optimizer can never modify it.

A Sentinel watchdog continuously monitors for:

  • 🎯 Reward hacking (suspicious score jumps)
  • 💰 Cost anomalies (unexpected API cost spikes)
  • 🔒 Eval tampering (checksum mismatch)
  • 📉 Score regression and oscillation

🔧 Pluggable Algorithm Registry

Add coded optimization algorithms alongside the LLM-as-optimizer:

from autoagent.algorithms import BaseOptimizer, register_optimizer

class MyOptimizer(BaseOptimizer):
    name = "my-optimizer"
    handles_levels = [1]        # Only Level 1 (behavior)
    handles_types = ["prompt"]

    def can_handle(self, target): return target.param_type == "prompt"
    def optimize(self, target, eval_fn, budget): ...

register_optimizer(MyOptimizer())

The framework handles everything else: eval, tracking, skills, dashboard. Skills remember which algorithm worked at which level for automatic selection in future runs.


🚀 Quick Start

Installation

git clone https://github.com/skyve2012/AutoAgentClaw.git
cd AutoAgentClaw
pip install -e .
autoagent setup    # Verify requirements

Run on the Demo

autoagent run \
  --target docs/examples/hotpotqa-agents \
  --eval eval.py \
  --metric score \
  --direction maximize \
  --max-experiments 8 \
  --auto-approve

The demo includes a deliberately suboptimal researcher + reasoner system that AutoAgentClaw optimizes by discovering better prompts, tuning parameters, and improving output format.

View Dashboard

autoagent dashboard --target docs/examples/hotpotqa-agents
# Opens http://localhost:3000 — topology viz, leaderboard, score chart

Run on Your Own Agents

autoagent run \
  --target ~/my-agents \
  --eval eval.py \
  --metric accuracy \
  --direction maximize

Requirements for your agent repo: An eval.py that returns a JSON with your metric (e.g., {"accuracy": 0.85}). That's it.


📋 Configuration

autoagent init    # Interactive config wizard
📝 Minimum required config
target:
  path: "~/my-agents"
  eval_script: "eval.py"
  metric: "accuracy"
  direction: "maximize"
📋 Full Configuration Reference
# === LLM Provider ===
llm:
  provider: "openai"
  base_url: "https://api.openai.com/v1"
  api_key_env: "OPENAI_API_KEY"
  primary_model: "gpt-4o"
  fallback_models: ["gpt-4o-mini"]

# === Target Agent System ===
target:
  path: "~/my-agents"
  eval_script: "eval.py"
  metric: "accuracy"
  direction: "maximize"

# === Search Space (auto-discovered if omitted) ===
search_space:
  freeze: []           # Param IDs to NOT optimize
  priority: []         # Optimize these first
  dimensions: {}       # Override algorithm per dimension

# === Optimization Budget ===
budget:
  max_experiments: 100
  max_time_minutes: 180
  max_cost_usd: 50.00
  pilot_experiments: 5

# === Execution ===
execution:
  worker_mode: "autonomous"   # per-experiment | autonomous
  max_parallel_workers: 3
  eval_timeout_sec: 300

# === Pipeline Gates ===
gates:
  auto_approve: false         # Set true for overnight runs

# === Knowledge / Cross-Run Learning ===
knowledge:
  enabled: true
  skills_dir: "~/.autoagent/skills"
  decay_rate: 0.05

# === Notifications ===
notifications:
  channel: "console"          # console | discord | slack
  notify_on: ["new_best_score", "optimization_complete"]

# === Dashboard ===
dashboard:
  port: 3000
  auto_open: true

# === OpenClaw Bridge ===
openclaw_bridge:
  use_cron: true
  use_message: true
  use_memory: true
  use_sessions_spawn: true
  use_web_fetch: true
  use_browser: false

# === Sentinel Watchdog ===
sentinel:
  enabled: true
  detect_reward_hacking: true
  detect_cost_anomaly: true

💻 CLI Reference

autoagent setup                       # Verify system requirements
autoagent init                        # Create config interactively
autoagent run [options]               # Run optimization pipeline
autoagent resume --target <path>      # Resume an interrupted run
autoagent dashboard [options]         # Start web dashboard
autoagent log --target <path>         # View experiment log + leaderboard
autoagent status --target <path>      # Show all runs + skill count
autoagent skills                      # List accumulated skills
🔧 autoagent run options
Flag Description Default
--target, -t Path to target agent repo (required)
--eval Evaluation script path eval.py
--metric, -m Metric name to optimize accuracy
--direction, -d maximize or minimize maximize
--config Config file path auto-detect
--max-experiments Max experiments to run 100
--max-cost Max cost in USD 50.0
--max-time Max time in minutes 180
--auto-approve Skip all approval gates false
--interactive, -i Interactive mode false

🐍 Python API

from autoagent.pipeline.runner import PipelineRunner
from autoagent.config import load_config
from pathlib import Path

config = load_config("config.autoagent.yaml")
runner = PipelineRunner(config, Path("~/my-agents").expanduser())
result = runner.run()

print(f"Baseline: {result.baseline_score:.4f}")
print(f"Best:     {result.best_score:.4f} (+{result.improvement_pct:.1f}%)")

📂 Output Artifacts

# Per-project (in your agent repo)
.autoagent/
├── runs/
│   └── ao-YYYYMMDD-HHMMSS/
│       ├── discovery.json          # Agents, params, topology graph
│       ├── strategy.json           # Optimization plan (phases, algorithms)
│       ├── attempts.jsonl          # Every experiment (scores, diffs, feedback)
│       ├── leaderboard.json        # Top-10 ranked configurations
│       └── report.md               # Human-readable optimization report
├── private/
│   └── eval.py                     # Protected eval (tamper-proof copy)
└── knowledge/
    ├── attempts/                   # CORAL-style shared attempt records
    └── notes/                      # Worker observations and learnings

# Persistent knowledge (across runs + projects)
~/.autoagent/
├── projects/<project-hash>/
│   ├── skills/                     # Research + learned skills for THIS project
│   │   └── research-textgrad/SKILL.md
│   ├── .learnings/
│   │   ├── LEARNINGS.md            # What worked / failed (structured entries)
│   │   └── ERRORS.md               # What went wrong and why
│   └── lessons/
│       └── round-001.jsonl         # Structured experiment records per round
└── skills/
    ├── _default/                   # 12 curated optimization skills (ship with framework)
    └── _learned/                   # Skills promoted from 3+ projects

🔀 Multiple Ways to Run

Method Command Setup
🦞 OpenClaw "Optimize my agents at ~/my-agents" Zero — just chat
💻 CLI autoagent run --target ~/my-agents --eval eval.py pip install -e .
🐍 Python API PipelineRunner(config, path).run() pip install -e .
Claude Code claude in the AutoAgentClaw directory Clone repo
Codex CLI codex in the AutoAgentClaw directory Clone repo
Any AI Assistant Point it at AUTOAGENT_AGENTS.md Clone repo

📊 Architecture

AutoAgentClaw/
├── autoagent/
│   ├── cli.py                  # CLI: setup, init, run, dashboard, log, status, skills
│   ├── config.py               # YAML config (Pydantic models)
│   ├── models.py               # 4-level data models (MaAS hierarchy)
│   ├── pipeline/
│   │   ├── runner.py           # 12-stage pipeline orchestrator
│   │   └── stages.py          # Stage definitions with gate support
│   ├── agents/
│   │   ├── discovery_agent.py  # Auto-discovers agents, prompts, topology
│   │   ├── research_agent.py   # Multi-source research (SS + arXiv + web)
│   │   ├── strategy_agent.py   # Level-based optimization planning
│   │   ├── worker_agent.py     # CORAL-style LLM-as-optimizer workers
│   │   ├── autonomous_worker.py # Long-lived autonomous sessions
│   │   ├── prompt_opt_agent.py # Evolutionary fallback optimizer
│   │   └── reflect_agent.py   # Heartbeat reflection + convergence
│   ├── evaluator/
│   │   ├── runner.py           # Protected eval runner
│   │   ├── tracker.py          # Experiment tracking + leaderboard
│   │   └── sentinel.py        # Watchdog (reward hacking, cost, tampering)
│   ├── algorithms/
│   │   └── registry.py        # Pluggable algorithm registry
│   ├── knowledge/
│   │   ├── skill_manager.py    # Cross-run learning (extract, inject, decay)
│   │   └── shared.py          # CORAL-style shared knowledge filesystem
│   ├── bridge/
│   │   └── adapters.py        # OpenClaw bridge (6 typed adapter protocols)
│   ├── llm/
│   │   └── providers.py       # Multi-provider (Claude CLI, API, ACP, OpenAI)
│   ├── dashboard/
│   │   ├── server.py           # FastAPI backend (8 API endpoints)
│   │   └── static/index.html  # Dashboard UI (topology, charts, leaderboard)
│   └── workspace.py           # Git worktree manager (parallel workers)
├── AUTOAGENT_AGENTS.md         # OpenClaw service definition
├── CLAUDE.md                   # Claude Code instructions
├── config.autoagent.example.yaml
├── .claude/skills/
│   ├── autoagent/             # Main entry skill (OpenClaw integration)
│   ├── autoagent-skill-creator/ # Meta-skill for creating optimization skills
│   ├── autoagent-grid-search/ # Algorithm skill: systematic parameter search
│   ├── autoagent-feedback-loop/ # Algorithm skill: feedback-driven revision
│   └── autoagent-optuna-search/ # Algorithm skill: Bayesian optimization
├── docs/
│   ├── images/logo.png
│   └── examples/
│       ├── hotpotqa-agents/   # 2-agent QA (Real HotpotQA, +29.3%)
│       ├── math-reasoning/    # 3-agent math (Real GSM8K, +7.1%)
│       ├── customer-support/  # 3-agent support (Synthetic, +5.1%)
│       └── code-generation/   # 1-agent coder (Real HumanEval)
└── tests/                      # 27 unit tests

✨ Key Contributions

Contribution Description
🔍 Zero-Config Agent Discovery Reads any agent repo and auto-discovers tunable parameters — no framework lock-in, no rewriting required
📐 4-Level Optimization Hierarchy MaAS-informed systematic approach: Behavior → Configuration → Communication → Topology, optimizing cheapest levels first
🔬 Research-Driven Strategy Multi-source research (Semantic Scholar + arXiv + GitHub + PyPI) before optimization — the first agent optimizer to search for code implementations
🧠 Self-Improving Skill System OpenClaw-native three-tier knowledge: curated defaults → project-specific learnings → cross-project promotions with confidence decay
🤖 Autonomous LLM-as-Optimizer CORAL-style long-lived Claude sessions that read code, propose changes, evaluate, and iterate — achieving 200-700% improvements
🔄 Pluggable Algorithm Skills Algorithms are installable skills with SKILL.md + scripts/ — add a new optimizer by creating a directory
🛡️ Protected Eval + Sentinel Tamper-proof evaluation with watchdog monitoring for reward hacking, cost anomalies, and score oscillation
📊 Live Dashboard Real-time topology visualization, performance charts, experiment leaderboard, and dimension heatmaps
🦞 OpenClaw Native Works with Claude subscription ($0 API cost), supports 6 ACP agents, bridge adapters for parallel execution

📋 Roadmap

Core Pipeline

  • 12-stage pipeline with human approval gates
  • Zero-config agent discovery (LLM-driven + regex fallback)
  • Multi-source research (Semantic Scholar + arXiv + GitHub + PyPI)
  • 4-level optimization hierarchy (MaAS-informed)
  • Protected evaluator with checksum verification
  • Sentinel watchdog (reward hacking, cost anomaly, score oscillation)
  • Cross-validation with git snapshot restore
  • Optimization report generation

Execution Modes

  • Per-experiment mode (one LLM call per experiment)
  • Autonomous mode (CORAL-style long-lived Claude sessions)
  • Parallel workers via git worktrees
  • SIGINT interrupt-resume heartbeat (CORAL-style mid-session reflection)
  • autoagent resume command for crashed/interrupted runs

Skill System

  • 12 curated default optimization skills
  • OpenClaw-native skill format (SKILL.md + scripts/ + references/)
  • Skill-creator meta-skill for generating new skills
  • Three-tier knowledge: default → project → learned
  • Project-scoped skills (no cross-project leakage)
  • Confidence decay on stale skills
  • Structured learnings (LEARNINGS.md + round-N.jsonl)
  • Skill promotion after 3+ project confirmations
  • Auto-install pip packages discovered during research
  • Generate BaseOptimizer adapters for discovered libraries

Algorithms

  • LLM-as-optimizer (autonomous Claude sessions)
  • Grid search (systematic parameter sweep)
  • Feedback loop (iterative LLM revision)
  • Optuna Bayesian optimization
  • Pluggable algorithm registry with auto-discovery
  • Integration with real TextGrad library
  • Integration with DSPy MIPROv2
  • MCTS-based topology search (AFlow-style)
  • Multi-objective Pareto optimization

Integration & UX

  • OpenClaw integration (AUTOAGENT_AGENTS.md + bridge adapters)
  • ACP support (Claude, Codex, Gemini, OpenCode, Kimi)
  • Claude subscription auth ($0 cost)
  • Web dashboard (topology viz, leaderboard, charts)
  • CLI (setup, init, run, dashboard, log, status, skills)
  • Verify and test non-Claude ACP agents (Codex, Gemini, etc.)
  • Docker/SSH remote execution for heavy evaluations
  • ClaweHub skill publishing
  • Multi-language README (CN, JA, KR)

Benchmarks & Examples

  • HotpotQA — 2-agent QA (real data, +29.3%)
  • GSM8K Math — 3-agent reasoning (real data, +7.1%)
  • Customer Support — 3-agent triage (synthetic, +5.1%)
  • Code Generation — 1-agent coder (real HumanEval data)
  • GAIA benchmark (compare with EvoAgentX results)
  • SWE-bench (software engineering agents)
  • WebArena (web navigation agents)
  • Multi-agent debate optimization benchmark

Advanced Features (Future)

  • Multi-round auto-continue (run N rounds until diminishing returns)
  • Cross-model adversarial review (Claude optimizes, GPT reviews)
  • Pilot experiments before strategy commitment
  • Multi-agent debate for strategy planning
  • Fine-tuning integration (Level 5: RL/SFT on agents)
  • Cost-aware Pareto optimization (quality vs latency vs cost)

🙏 Acknowledgments

AutoAgentClaw builds on ideas from:

  • CORAL — Multi-agent evolution infrastructure (parallel workers, protected eval, shared knowledge, heartbeat)
  • AutoResearchClaw — OpenClaw integration, MetaClaw cross-run learning, staged pipeline, bridge adapters
  • ResearchClaw — Claim/evidence graph, provenance tracking, experiment contracts
  • ARIS — Research-driven strategy, overnight autonomous runs
  • Karpathy's autoresearch — The outer-loop optimizer pattern
  • MaAS — 4-level optimization hierarchy for multi-agent systems
  • EvoAgentX — Pluggable algorithm registry pattern
  • OpenClaw Skill System — AgentSkills standard, self-improving skill pattern

📄 License

MIT


📝 Citation

If you find AutoAgentClaw useful in your research or projects, please cite:

@software{shen2026autoagentclaw,
  title     = {AutoAgentClaw: Automatic Multi-Agent System Optimization},
  author    = {Shen, Hongyu},
  year      = {2026},
  url       = {https://github.com/skyve2012/AutoAgentClaw},
  note      = {A framework for automatically optimizing agent systems through
               zero-config discovery, research-driven strategy, and self-improving
               skills. Built on OpenClaw with CORAL-style parallel execution and
               MaAS-informed 4-level optimization hierarchy.}
}

Built with 🦞 by Hongyu Shen

About

Automatic optimization for multi-agent systems — prompts, config, communication, topology. OpenClaw-native.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors