Complementary benchmark for multi-agent group behavior #217
ThinkOffApp
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
AgentBench evaluates individual agent capability across 8 environments, which is valuable for understanding single-agent performance. We built OMATS (OpenClaw Multi-Agent Test Suite, https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite) to cover a dimension AgentBench doesn't: how models behave when multiple agents share a room.
OMATS has 28 scripted scenarios across three stages. Stage 3 tests single-agent discipline (loop avoidance, idle management). Stage 4 tests multi-agent communication (stop orders, echo resistance, social pressure, indirect address). Stage 5 tests agent management (task delegation, noise control, conflict resolution).
The failure modes are different from single-agent tasks. Models that score well on AgentBench-style benchmarks can still fail badly in group settings: echoing what others said, doing duplicate work, building guardrails on each other, and going into ACK loops.
We've done a first testing run of 12 models so far with continuous 0.0-1.0 scoring by using LLM APIs simulating OpenClaw requests. The results differentiate well between model tiers and reveal some surprises (e.g., newer model versions sometimes score worse than older ones on group behavior). Next we will run the tests with actual OpenClaw agents.
Beta Was this translation helpful? Give feedback.
All reactions