Skip to content

brainqub3/agent-labs

Repository files navigation

Brainqub3 Agent Labs

Local-first, open-source Agent Arena for prototyping single-agent systems (SAS), multi-agent systems (MAS), evaluator-first experimentation, and paper-aligned scaling analysis.

About Brainqub3

Brainqub3 is an AI agent consultancy focused on production-grade workflow automation. The goal is to ship measurable workflow improvements in weeks, not quarters, integrated into real systems with authentication, logging, guardrails, and ownership handoff.

  • Website: https://brainqub3.com
  • Core delivery model: integrated software outcomes (not workshops, decks, or long transformation programs)
  • Typical engagement cadence:
    • 1-2 sprints: Prototype to Production
    • 2-4 sprints: Workflow Integration Build
    • 1-2 days/week: Fractional Engineer retainer
  • What Brainqub3 delivers:
    • Workflow automation (triage, routing, intake, enrichment, summarisation)
    • Copilots and agents embedded in existing tools (ticketing, CRM, Slack, email, dashboards)
    • Production hardening (evals, guardrails, logging, monitoring, rollback planning)
  • Delivery loop: scope -> integrate -> guardrails -> measure -> ship -> handoff
  • Repository maintenance: this repository is maintained by Brainqub3 founder, John Adeojo.
  • Company: Brainqub3 is a consulting brand of DATA-CENTRIC SOLUTIONS LTD (England and Wales, Company No. 14829432), London, UK.

Environment Contract (Required)

Every local run (human or agent-driven) must satisfy:

  1. Python 3.11+
  2. uv installed and on PATH
  3. Project dependencies synced into .venv via uv
  4. Official Python package claude-agent-sdk installed in that environment
  5. ANTHROPIC_API_KEY present in .env or shell environment

Use uv run ... for all project commands.

Claude Code Auto Setup (Deterministic)

For first-clone setup in Claude Code, run:

/lab-setup

If your Claude Code UI namespaces project commands, use /project:lab-setup.

For live checks in namespaced UIs, use /project:lab-setup-live.

Modes:

  • /lab-setup: deterministic local bootstrap + offline doctor checks.
  • /lab-setup-live: full live doctor checks (requires ANTHROPIC_API_KEY).

Claude may not be able to read .env due repo permissions, so before /lab-setup-live ensure .env is already present with ANTHROPIC_API_KEY (or export the key in shell environment).

These commands use the repo preflight scripts and work across Windows, Linux, and macOS.

Fast Path (Recommended)

Run preflight before SAS/MAS runs.

Unix/macOS:

bash scripts/preflight.sh --bootstrap-uv

Windows PowerShell:

./scripts/preflight.ps1 --bootstrap-uv

Preflight does all of the following:

  1. Installs uv if missing (when --bootstrap-uv is used)
  2. Syncs dependencies (uv sync --locked --all-extras if uv.lock exists, else uv sync --all-extras)
  3. Verifies official claude-agent-sdk is installed in the uv environment
  4. Runs brainqub3 doctor (or --offline checks)
  5. Fails fast on any missing requirement

Manual Setup (Step-by-Step)

1) Install Python 3.11

Windows:

py -3.11 --version

Unix/macOS:

python3.11 --version

2) Install uv

Windows:

py -3.11 -m pip install --user uv

Unix/macOS:

python3.11 -m pip install --user uv

3) Sync dependencies

From repo root:

uv --version
uv sync --all-extras

If uv.lock exists and you want strict reproducibility:

uv sync --locked --all-extras

4) Configure API key

Create .env from template and set ANTHROPIC_API_KEY:

cp .env.example .env

Windows PowerShell alternative:

Copy-Item .env.example .env

5) Verify official Agent SDK + environment

uv run python -c "import importlib.metadata as m; print(m.version('claude-agent-sdk'))"
uv run brainqub3 doctor

If doctor fails, do not run SAS/MAS until it passes.

Proxy Note (Common Failure Mode)

Live runs will fail if proxy variables point to a dead proxy. If needed, clear them in your current shell:

Windows PowerShell:

$env:HTTP_PROXY=''; $env:HTTPS_PROXY=''; $env:ALL_PROXY=''
$env:GIT_HTTP_PROXY=''; $env:GIT_HTTPS_PROXY=''

Unix/macOS:

unset HTTP_PROXY HTTPS_PROXY ALL_PROXY GIT_HTTP_PROXY GIT_HTTPS_PROXY

Minimal Validation Runbook

Run evaluator tests first:

uv run pytest brainqub3/tasks/examples/hello_world/tests -q

Live SAS:

uv run brainqub3 run sas --task hello_world --model claude-sonnet-4-5 --instances 3 --require-live

Live MAS:

uv run brainqub3 run mas --task hello_world --arch hybrid --model claude-sonnet-4-5 --n-agents 3 --instances 3 --require-live

By default, run sas and run mas auto-load the HTML dashboard at http://127.0.0.1:8765. Use --no-dashboard to disable auto-loading for non-interactive runs.

Dashboard:

uv run brainqub3 dashboard

By default this starts a lightweight local web server at http://127.0.0.1:8765. You can override runtime options:

uv run brainqub3 dashboard --host 0.0.0.0 --port 9000 --refresh-ms 3000

If you cloned the repo and want dashboard data from committed runs:

uv run brainqub3 run reindex
uv run brainqub3 dashboard

Backend Modes

  • Default behavior is live-first. Missing SDK/key causes fail-fast unless mock is explicitly allowed.
  • Use --require-live for validation runs; it hard-fails if backend is not live.
  • Use --allow-mock only for explicit offline dry-runs.
  • Do not combine --require-live with --allow-mock.
  • Run artefacts store runtime.backend_mode and runtime.backend_reason.
  • SDK transport buffer defaults to 10MB in Arena runs; override via BRAINQUB3_SDK_MAX_BUFFER_SIZE_BYTES when needed.

Run Immutability

  • Completed runs are finalized with data/runs/<run_id>/run_manifest.json (content hashes).
  • Finalized runs are immutable: recomputing metrics in-place is blocked.
  • Use soft delete to tombstone a run in the local index while keeping artefacts.
  • Use hard delete to remove both artefacts and index rows.
  • Empirical prediction outputs are written to data/predictions/ to avoid mutating finalized runs.
  • Rebuild the SQLite index from canonical run artefacts with run reindex.

Core Commands

  • uv run brainqub3 task init <task_name>
  • uv run brainqub3 eval test <task_name>
  • uv run brainqub3 doctor
  • uv run brainqub3 run sas --task <task> --model <model> --instances <N> --require-live [--no-dashboard]
  • uv run brainqub3 run sas --task <task> --model <model> --instances <N> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --require-live [--no-dashboard]
  • uv run brainqub3 run mas --task <task> --arch <independent|centralised|decentralised|hybrid> --model <model> --n-agents <n> --instances <N> --allowed-tools Read,Write,Edit,Bash,Glob,Grep,WebSearch,WebFetch --require-live [--no-dashboard]
  • uv run brainqub3 run elasticity --task <task> --arch <independent|centralised|decentralised|hybrid> --model <model> --n-agents-grid 3,4 --tool-count-grid 6,8 --batch-id <batch>
  • uv run brainqub3 run delete --run-id <id> [--reason "<text>"]
  • uv run brainqub3 run delete --run-id <id> --hard
  • uv run brainqub3 run reindex [--verify-manifests]
  • uv run brainqub3 metrics compute --run-id <id>
  • uv run brainqub3 model predict --run-id <id>
  • uv run brainqub3 scenario run --scenario data/scenarios/<name>.yaml [--batch-id <batch>]
  • uv run brainqub3 dashboard

Temp Files Policy

  • Store ad-hoc analysis scripts and scratch outputs in temp/.
  • temp/ is git-kept via temp/.gitkeep; all other files under temp/ are ignored.
  • Do not create new scratch files in the repository root.

Scaling Model (Simple)

  • The scaling model learns elasticities from controlled empirical run pairs per MAS architecture and per coordination metric (overhead_pct, message_density_c, redundancy_R, efficiency_Ec, error_amp_Ae).
  • eta_n is estimated from pairs where tool_count_T is fixed and n_agents changes: eta_n = ln(x_j / x_i) / ln(n_j / n_i).
  • eta_T is estimated from pairs where n_agents is fixed and tool_count_T changes: eta_T = ln(x_j / x_i) / ln(T_j / T_i).
  • The model uses the median of valid pairwise slopes as the elasticity estimate for each axis.
  • Each base coordination metric is then scaled with a non-dimensional factor: x_hat = clamp(x_base * (n_agents / n0)^eta_n * (tool_count_T / T0)^eta_T).
  • Scaled coordination metrics are fed into the Table 4 prediction model to produce P_hat; dashboard curves plot SAS-relative deltas (delta_vs_sas = P_architecture - P_SAS).
  • In the dashboard Scaling Laws tab, select the explicit elasticity calibration batch for scaling analysis (not a SAS-vs-MAS comparison-only batch).

Paper reference:

Intelligence index source:

Scaling guardrails:

  • Hypothetical MAS scaling analysis now fails fast when required elasticity support is missing (for example source=default_zero / zero controlled pairs).
  • Calibrate first with shared-batch elasticity sweeps: uv run brainqub3 run elasticity --task <task> --arch <arch> --batch-id <batch> --n-agents-grid 3,4 --tool-count-grid 6,8.
  • Elasticity tool policy uses core-vs-full by default: core is Read, Write, Edit, Bash, Glob, Grep; full adds WebSearch and WebFetch.
  • Live runs now enforce allowed_tools at runtime through claude-agent-sdk permission callbacks (not metadata-only). Tool attempts are recorded in telemetry tool_called events.

About

Agent labs built by Brainqub3

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages