Skip to content

Conversation

@gsmithline
Copy link

Summary

  • Fixed performance bug where RNAD/NFSP agents were being re-instantiated on every game iteration
  • Agents are now created once before the game loop and reused

Problem

The RNaDAgentWrapper and NFSPAgentWrapper were created inside the for gi in range(games) loop in pyspiel_runner.py. This caused:

  • JAX/Haiku model re-initialization on every game
  • Repeated JIT compilation overhead
  • 50 games × 2 agents = 100 model loads instead of 2

On GitHub Actions (AMD EPYC CPUs), JAX compilation is slow, causing runs to take hours instead of minutes.

Solution

Move agent instantiation before the game loop so models are loaded once and reused.

Test plan

  • Tested locally with docker compose - games progress smoothly
  • RNAD models now load only once at startup
  • Rebuild Docker image and retest on GitHub Actions

🤖 Generated with Claude Code

gsmithline and others added 30 commits November 29, 2025 14:25
…ndle non-JSON responses in remote negotiator
- Fix syntax error in nfsp.py (class indentation)
- Fix allocation logic in pyspiel_runner.py (correct who gets what on accept)
- Fix indentation in run_entire_matrix.py
- Remove dead code in pyspiel_runner.py
- Add missing __init__.py files for proper package structure
- Add GitHub Actions workflow for Docker image publishing to GHCR
- Add meta-game framework documentation to README
- Add SUBMISSION.md for competition abstract
- Update .gitignore to exclude runtime data directories

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Simplifies Docker build by using pre-built wheel from PyPI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- abseil-cpp -> open_spiel/abseil-cpp
- pybind11 -> pybind11
- pybind11_abseil -> open_spiel/pybind11_abseil

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add double_dummy_solver clone from jblespiau/dds repo
- Add pybind11 Python binding source files
- Fix PYTHONPATH to remove undefined variable warning

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rewrite main README to focus on meta-game evaluation framework
- Add quick start with local, Cloud Run, and platform options
- Document assessment configuration and purple agent requirements
- Enhance SUBMISSION.md with competition compliance details
- Add evaluation metrics explanation and result format

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Correct regret formula: Regret(π) = max(0, u(π, σ*) - u(σ*))
  (matches code implementation in mene_solver.py)
- Clarify M[i][j] = agent i's average payoff against agent j

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added required dependencies for MENE solver:
- cvxpy>=1.4.0
- numpy>=1.26.0
- clarabel (cvxpy solver backend)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed the 'Interpreting Results' section from the README.
- Compute standard error = std / sqrt(n) for regrets and agent metrics
- Add SE to per-agent output in run_metagame_analysis
- Update SUBMISSION.md with CI format in result schema
- Following paper methodology from Wiedenbeck et al. 2014

Results now include mean±SE format matching IJCAI paper Figure 1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bootstrap standard error is the standard deviation of the bootstrap
distribution, not divided by sqrt(n). The division by sqrt(n) is for
standard error of the mean with independent samples, but in bootstrap
resampling the std of bootstrap estimates directly estimates the SE.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add results/ directory with baseline_evaluation.json containing
  metrics for 6 baseline agents (soft, tough, aspiration, walk, nfsp, rnad)
- Add scenario.toml for leaderboard configuration
- Results include MENE regret, welfare metrics (UW, NW, NWA), and EF1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove nested results structure that caused DuckDB binding errors
- Create separate JSON file per agent with flat structure
- All fields (agent_name, mene_regret, etc.) now at top level

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ated results

Results should be generated by CI workflow in a dedicated leaderboard repository,
not manually committed to the agent repo.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
gsmithline and others added 10 commits January 13, 2026 10:45
The MILP solver can return solutions that are very close to Nash
equilibria but fail the strict 1e-6 regret check due to numerical
precision issues. Increasing tolerance to 1e-4 provides more robustness.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of failing when regret exceeds tolerance, warn and return
the best solution found. This handles numerical precision issues
without failing the entire evaluation pipeline.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a remote agent named "challenger" raises RemoteNegotiatorError
(e.g., by returning {"action": "WALK"} without an allocation), the
fallback behavior now correctly treats it as a walk policy instead
of the default balanced policy.

This fixes a bug where challenger agents would inadvertently make
offers and accept deals when the remote agent failed to provide a
valid allocation, leading to incorrect NWA% and EF1% metrics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous fix only checked for "challenger" in _propose_allocation
and _accepts, but _policy_kind("challenger") returned "balanced"
because none of the conditions matched.

Now _policy_kind recognizes "challenger" as a walk policy, so when
the remote agent fails, the fallback uses walk behavior (always reject,
zero allocation).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The analysis code in main.py looks for traces at {pair_key}.jsonl,
but walk baseline traces were written to walk_baseline.jsonl only.
This caused all metrics (NW, UW, MENE regret) to show 0% for walk-type
agents because load_records() couldn't find the files.

Fix: Create symlinks from pair-specific paths to walk_baseline.jsonl
so the analysis code can find and process the walk traces correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove 'challenger' from walk policy detection in _policy_kind()
- Challenger now goes through run_pyspiel_pair_nfsp_with_traces() for real games
- This enables validation with actual game data from the reject-agent
- Check if agents are in remote_agent_urls before using walk baseline
- This ensures challenger/purple agents play actual games via A2A protocol
- Synthetic walk baseline only used for local walk agent vs local agents
Previously, RNaDAgentWrapper and NFSPAgentWrapper were instantiated
inside the `for gi in range(games)` loop, causing model re-initialization
on every game. This triggered repeated JAX/Haiku compilation, which is
extremely slow on certain CPUs (e.g., GitHub Actions AMD EPYC runners).

With 50 games, this meant 100 model loads instead of 2, causing runs
that should take minutes to take hours.

The fix moves agent creation before the loop so models are loaded once
and reused across all games.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant