An experimental platform that orchestrates structured debates between different AI models in a round-robin matrix format. Models debate both sides of issues against each other, providing insights into their reasoning capabilities, argumentation styles, and philosophical stances.
Each debate topic runs as a full matrix where every AI model debates every other model on both sides of the issue:
Topic: "Social media does more harm than good"
| Claude | Gemini | GPT | Grok |
---------|--------|--------|-----|------|
Claude | - | A/N | A/N | A/N |
Gemini | N/A | - | A/N | A/N |
GPT | N/A | N/A | - | A/N |
Grok | N/A | N/A | N/A | - |
A = Affirmative, N = Negative
Each cell represents two debates (model argues both sides)
We use a modified Lincoln-Douglas format optimized for AI, combined with Oxford-style judging:
| Phase | Speaker | Duration | Purpose |
|---|---|---|---|
| 1. Affirmative Constructive | Affirmative | ~800 words | Present case and framework |
| 2. Cross-Examination | Negative | 3 questions | Challenge affirmative's case |
| 3. Negative Constructive | Negative | ~1000 words | Present counter-case and rebuttals |
| 4. Cross-Examination | Affirmative | 3 questions | Challenge negative's case |
| 5. Affirmative Rebuttal | Affirmative | ~500 words | Address attacks, rebuild case |
| 6. Negative Rebuttal | Negative | ~700 words | Final attacks and summation |
| 7. Affirmative Rejoinder | Affirmative | ~400 words | Final defense and voting issues |
- Lincoln-Douglas base: 1v1 format is clean for AI-to-AI debates, emphasizes values and philosophy over pure evidence
- Cross-examination: Tests models' ability to respond under pressure and defend positions
- Affirmative speaks first and last: Standard debate convention that balances burden of proof
- Oxford-style judging: Audience/judge voting before and after measures persuasion, not just "who won"
Each debate is evaluated by:
-
AI Judge Panel (3 different models not participating in the debate)
- Score on: argumentation, evidence use, rebuttal quality, rhetorical effectiveness
- Provide reasoning for decisions
-
Human Voting (future)
- Pre-debate stance poll
- Post-debate stance poll
- Winner = side that moved more voters
-
Metrics Tracked
- Win/loss record per model
- Win rate by position (Affirmative vs Negative)
- Persuasion delta (stance change induced)
- Head-to-head records
Initial lineup (latest flagship reasoning models as of Jan 2026):
- Claude (Anthropic) -
claude-opus-4-5-20251101- Anthropic's most intelligent model - GPT-5.2 (OpenAI) -
gpt-5.2- OpenAI's flagship model (supersedes o3) - Gemini (Google) -
gemini-3-pro-preview- Google's most capable reasoning model (in preview) - Grok (xAI) -
grok-4- xAI's flagship model with native tool use
Topics follow the resolution format common in competitive debate, drawn from major policy debates of 2025.
AI & Technology
- "Resolved: AI regulation should be handled at the federal level, preempting state laws like the Colorado AI Act"
- "Resolved: Companies should be required to disclose when AI systems are used in hiring, pricing, or lending decisions"
- "Resolved: The benefits of AI job automation outweigh the costs to displaced workers"
- "Resolved: Military integration of commercial AI systems (like Grok in the Pentagon) poses unacceptable risks"
Economy & Trade
- "Resolved: The 2025 tariff increases have done more harm than good to the U.S. economy"
- "Resolved: Protectionist trade policies are justified to counter China's economic influence"
- "Resolved: Cryptocurrency should be regulated as a commodity rather than a security"
- "Resolved: The Federal Reserve should issue a Central Bank Digital Currency (CBDC)"
Society & Governance
- "Resolved: Social media platforms should be legally liable for algorithmic amplification of misinformation"
- "Resolved: Deepfake technology should be banned for non-entertainment purposes"
- "Resolved: Mass deportation policies cause more harm than benefit to the United States"
- "Resolved: DOGE-style government efficiency initiatives are an appropriate approach to reducing federal spending"
Privacy & Ethics
- "Resolved: Individuals should have the right to delete all personal data held by corporations"
- "Resolved: Facial recognition technology should be banned in public spaces"
- "Resolved: AI companies should be required to pay royalties to creators whose work was used in training data"
- Polling platform for topic suggestions
- Voting by humans and AI agents
- Weekly/monthly featured debates
ai-debate/
├── src/
│ ├── debate/
│ │ ├── engine.py # Core debate orchestration
│ │ ├── formats.py # Debate format definitions
│ │ └── prompts.py # System prompts for debaters
│ ├── models/
│ │ ├── base.py # Abstract model interface
│ │ ├── anthropic.py # Claude integration
│ │ ├── openai.py # GPT integration
│ │ ├── google.py # Gemini integration
│ │ └── xai.py # Grok integration
│ ├── judging/
│ │ ├── panel.py # AI judge panel logic
│ │ ├── scoring.py # Scoring criteria and rubrics
│ │ └── metrics.py # Statistics and analysis
│ ├── storage/
│ │ ├── debates.py # Debate transcript storage
│ │ └── results.py # Results and leaderboards
│ └── cli.py # Command-line interface
├── debates/ # Stored debate transcripts
├── results/ # Analysis and leaderboards
├── tests/
└── pyproject.toml
# Run a single debate
ai-debate run --topic "Social media does more harm than good" \
--affirmative claude --negative gpt
# Run full matrix for a topic
ai-debate matrix --topic "Privacy vs security" --models claude,gpt,gemini,grok
# View leaderboard
ai-debate leaderboard
# Export debate transcript
ai-debate export --debate-id <id> --format markdown# Clone and install
git clone https://github.com/harper/ai-debate.git
cd ai-debate
uv sync
# Configure API keys
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
export GOOGLE_API_KEY=...
export XAI_API_KEY=...- Core debate engine with LD format
- Model integrations (Claude, GPT, Gemini, Grok)
- AI judge panel
- CLI for running debates
- Transcript storage and export
- Basic leaderboard
- Web viewer for debate transcripts
- Human voting integration
- Community topic suggestions
- Real-time debate streaming
This project aims to explore:
- Do models argue differently when defending positions they might "disagree" with?
- Which models are most persuasive? Most logically rigorous?
- Are there systematic differences in argumentation style between model families?
- How do models perform in adversarial reasoning against each other?
- Can models identify and exploit weaknesses in other models' arguments?
- Bias detection: Since models argue both sides of each issue, can we detect underlying biases by comparing argument quality/effort across positions? Do judges show favoritism toward their own model family?
MIT
Debate format research: