Complementary benchmark for multi-agent group behavior #217

ThinkOffApp · 2026-03-08T15:47:52Z

ThinkOffApp
Mar 8, 2026

AgentBench evaluates individual agent capability across 8 environments, which is valuable for understanding single-agent performance. We built OMATS (OpenClaw Multi-Agent Test Suite, https://github.com/ThinkOffApp/openclaw-multi-agent-test-suite) to cover a dimension AgentBench doesn't: how models behave when multiple agents share a room.

OMATS has 28 scripted scenarios across three stages. Stage 3 tests single-agent discipline (loop avoidance, idle management). Stage 4 tests multi-agent communication (stop orders, echo resistance, social pressure, indirect address). Stage 5 tests agent management (task delegation, noise control, conflict resolution).

The failure modes are different from single-agent tasks. Models that score well on AgentBench-style benchmarks can still fail badly in group settings: echoing what others said, doing duplicate work, building guardrails on each other, and going into ACK loops.

We've done a first testing run of 12 models so far with continuous 0.0-1.0 scoring by using LLM APIs simulating OpenClaw requests. The results differentiate well between model tiers and reveal some surprises (e.g., newer model versions sometimes score worse than older ones on group behavior). Next we will run the tests with actual OpenClaw agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complementary benchmark for multi-agent group behavior #217

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Complementary benchmark for multi-agent group behavior #217

Uh oh!

ThinkOffApp Mar 8, 2026

Replies: 0 comments

ThinkOffApp
Mar 8, 2026