🧠 Adaptive Teaching Evaluation (ATE)

A Competition-Grade Benchmark for Measuring Adaptive Teaching in AI Agents

🌍 Overview

Adaptive Teaching Evaluation (ATE) is an open, reproducible AgentBeats benchmark (Green Agent) designed to rigorously evaluate how effectively teaching agents adapt to simulated learners with realistic learning patterns, misconceptions, and behavioral challenges.

What makes ATE unique:

Concept-Level Curriculum Tracking: Enforces prerequisite chains (can't teach F=ma before Newton's 1st Law)
Misconception Detection: Tracks 5 specific misconceptions and penalizes agents who reinforce incorrect beliefs
Adversarial Learner Profiles: Tests agents against challenging learners (persistent misconceptions, plateau learners, distracted learners)
Reasoning Depth Analysis: LLM-based scoring that rewards explanations of WHY, not just WHAT
Anti-Gaming Measures: Detects and penalizes empty encouragement, generic responses, and avoidance tactics

This project is part of the AgentX–AgentBeats competition (Phase 1) at UC Berkeley RDI, contributing a benchmark that goes beyond basic metrics to measure true teaching quality.

Architecture

Green Agent = Evaluator (orchestrates the entire evaluation)

Runs the experiment with configurable learner profiles
Controls learner model with concept-level knowledge tracking
Analyzes agent responses for reasoning depth and pedagogical quality
Computes 15+ comprehensive metrics
Detects curriculum violations and misconception handling

Purple Agent = Teaching agent being tested

Receives prompts from Green Agent about learner state
Generates instructional responses (explain, quiz, clarify, scaffold, etc.)
Must adapt to learner patterns and address misconceptions
Scored on curriculum adherence, reasoning depth, and effectiveness

Learner Model = Realistic learner with deterministic state + LLM responses

Tracks concept-level knowledge (Newton's 1st, 2nd, 3rd Law, vectors, F=ma calculations)
Models 5 common misconceptions (heavier objects fall faster, force causes velocity, etc.)
Enforces prerequisite chains (must master basics before advanced topics)
Generates natural language responses via LLM reflecting current cognitive state
Supports 3 adversarial profiles: misconception-prone, plateau, distracted

🧩 Evaluation Flow

┌────────────────────────────────────────────────────────────────┐
│                        AgentBeats Platform                     │
├────────────────────────────────────────────────────────────────┤
│ 🟢 Green Agent (Evaluator): tutor_eval.py                      │
│   - Initializes learner with concept knowledge + misconceptions│
│   - Sends instructional scenarios to Purple Agent              │
│   - Analyzes responses for reasoning depth & quality           │
│   - Tracks curriculum violations (teaching out of order)       │
│   - Detects misconception handling (identify & correct)        │
│   - Checks for gaming behaviors (empty praise, avoiding tests) │
│   - Computes 15 metrics across 4 dimensions                    │
│                                                                │
│ 🟣 Purple Agent (Teaching Agent Being Tested): tutor_agent.py  │
│   - Receives prompts about learner understanding               │
│   - Must explain concepts with causal reasoning                │
│   - Should follow prerequisite chains                          │
│   - Identify and correct misconceptions                        │
│   - Adapt instructional strategy to learner profile            │
│                                                                │
│ 📚 Learner Model: student_simulator.py                         │
│   - 9D state: knowledge, confusion, engagement, misc strength  │
│   - Concept-level tracking: newton_1st, newton_2nd, vectors    │
│   - 5 specific misconceptions learners may hold                │
│   - Prerequisite enforcement (e.g., newton_2nd requires 1st)   │
│   - Profile-based behaviors (normal, misconception-prone, etc.)│
│   - LLM-generated responses reflecting cognitive state         │
└────────────────────────────────────────────────────────────────┘

⚙️ Setup & Installation

1. Clone the repo

git clone git@github.com:ashworks1706/ATeX.git
cd ATeX

2. Install dependencies

uv sync

3. Configure environment

cp sample.env .env

Add your LLM API key (Google Gemini, OpenAI, or OpenRouter) in .env.

� Evaluation Metrics

ATE uses 15 comprehensive metrics organized into 4 dimensions:

Core Learning Metrics (45% weight)

Learning Gain (0.18): Post-test - pre-test knowledge improvement
Misconception Handling (0.15): Effectiveness at identifying and correcting incorrect beliefs
Curriculum Adherence (0.12): Following prerequisite chains (e.g., teach newton_1st before newton_2nd)
Reasoning Depth (0.10): Explains why with causality, not just what (facts)

Teaching Quality (26% weight)

Personalization Efficiency (0.08): Learning gain per interaction
Adaptive Pacing (0.08): Matching teaching complexity to student readiness
Confusion Repair (0.07): Reduction in student confusion
Question Quality (0.06): Use of diagnostic probes and well-designed quizzes
Scaffolding Effectiveness (0.05): Quality of step-by-step guidance

Robustness & Reliability (8% weight)

Robustness (0.05): Consistency across interactions (low variance)
Stability (0.03): Reproducibility across evaluations
Pedagogical Soundness (0.03): Overall teaching quality

Anti-Gaming Safeguards

Gaming Detection: Penalizes empty encouragement (-0.2), generic responses (-0.25), avoiding assessment (-0.3)
Misconception Reinforcement: Zero score if tutor reinforces incorrect beliefs
Curriculum Violations: -0.2 per violation (teaching advanced topics before prerequisites)

Composite Score: Weighted average of all metrics (0-1 scale), designed to discriminate between teaching strategies.

🎯 Unique Features

1. Concept-Level Curriculum Tracking

Unlike benchmarks with global "knowledge" scores, ATE tracks mastery of specific concepts:

newton_1st: Inertia and balanced forces
newton_2nd: F=ma and acceleration
newton_3rd: Action-reaction pairs
vectors: Magnitude and direction
fma_calculations: Numerical problem-solving

Prerequisite enforcement: Can't teach newton_2nd before student masters newton_1st and vectors

2. Specific Misconception Modeling

Five common physics misconceptions with detection and correction tracking:

Heavier objects fall faster
Force causes velocity (not acceleration)
Motion requires constant force
Action-reaction pairs cancel out
Friction always opposes motion direction

3. Adversarial Student Profiles

Three challenging student types to test tutor robustness:

Misconception-Prone: Persistent incorrect beliefs, resistant to correction (confusion_sensitivity: 1.3x)
Plateau: Stops learning at ~50% knowledge, requires strategy change
Distracted: Loses focus, asks off-topic questions (30% probability), volatile engagement

4. Reasoning Depth Analysis

LLM-based scoring that rewards:

Causal explanations (WHY things happen)
Analogies and concrete examples
Connecting to prior knowledge
Proactive misconception addressing

5. Anti-Gaming Detection

Identifies and penalizes low-effort strategies:

Empty encouragement without substance
Generic copy-paste responses
Avoiding quizzes when appropriate
Excessive praise to inflate engagement

Quick Start

Run the Adaptive Teaching benchmark

uv run agentbeats-run scenarios/adaptive_tutor/scenario.toml --show-logs

This will:

Start the Green Agent (Evaluator) and the Purple Agent (Teaching Agent)
Initialize a simulated learner with realistic learning dynamics
Run a pre-test, adaptive instruction loop, and post-test
Use LLM-powered analysis to classify and score agent responses
Output real-time logs of the instructional interaction
Generate evaluation charts and markdown report in results/

🧠 Core Components

1️⃣ Green Agent — `tutor_eval.py`

Orchestrates the entire instructional episode with LLM-powered analysis:

Runs pre-test → adaptive loop → post-test sequence
Uses Gemini LLM to classify agent responses (explain, quiz, clarify, scaffold, probe, etc.)
Scores response quality based on pedagogical rubrics
Calls the Purple Agent for each interaction step
Collects comprehensive metrics:
- Learning Gain — improvement in simulated knowledge
- Adaptive Pacing — quality of responses matched to learner readiness
- Misconception Handling — effectiveness at correcting learner errors
- Question Quality — use of diagnostic probes and quizzes
- Scaffolding Effectiveness — step-by-step guidance quality
- Engagement Maintenance — sustained learner interest
- Confusion Repair — reduction in confusion
- Plus: Personalization Efficiency, Robustness, Stability
Generates visual charts and detailed markdown report

2️⃣ Learner Model — `student_simulator.py`

Implements a sophisticated, deterministic learner model:

Maintains rich internal state: {knowledge, confusion, engagement, misconception_strength, cognitive_load, mastery_confidence}
Updates state based on agent action type and response quality
Tracks quiz performance (correct_responses, total_attempts)
Generates realistic learner utterances based on current state
Deterministic via seeded RNG for reproducibility

3️⃣ LLM Classifier — `llm_classifier.py`

Uses Gemini to analyze agent responses:

Classifies response type (explain, quiz, clarify, summarize, encourage, scaffold, probe)
Scores quality on 4 dimensions: appropriateness, clarity, pedagogical soundness, engagement
Context-aware: considers learner's knowledge, confusion, and recent behavior

4️⃣ Report Generator — `report_generator.py`

Creates visualizations and reports after evaluation:

Generates 4-panel chart: learning progress, core metrics, advanced metrics, composite score
Creates detailed markdown report with transcript and interpretation
Identifies strengths and weaknesses
Saves to results/ directory with timestamp

5️⃣ Purple Agent — `tutor_agent.py`

Baseline teaching agent that interacts with the learner:

Reads current prompt/state.
Generates helpful explanations or quizzes via LLM API.
Adapts tone or detail level based on prior feedback (optional).

Developers can later replace this with their own adaptive teaching models to compete on ATE.

🧠 Core Components

1️⃣ Green Agent — `tutor_eval.py`

Orchestrates the entire tutoring session with LLM-powered analysis:

Runs pre-test → adaptive loop → post-test sequence
Uses Gemini LLM to classify tutor responses (explain, quiz, clarify, scaffold, probe, etc.)
Scores response quality based on pedagogical rubrics
Calls the Purple Agent for each interaction step
Collects comprehensive metrics:
- Learning Gain — improvement in simulated knowledge
- Adaptive Pacing — quality of responses matched to student readiness
- Misconception Handling — effectiveness at correcting student errors
- Question Quality — use of diagnostic probes and quizzes
- Scaffolding Effectiveness — step-by-step guidance quality
- Engagement Maintenance — sustained student interest
- Confusion Repair — reduction in confusion
- Plus: Personalization Efficiency, Robustness, Stability
Generates visual charts and detailed markdown report

2️⃣ Student Simulator — `student_simulator.py`

Implements a sophisticated, deterministic student model:

Maintains rich internal state: {knowledge, confusion, engagement, misconception_strength, cognitive_load, mastery_confidence}
Updates state based on tutor action type and response quality
Tracks quiz performance (correct_responses, total_attempts)
Generates realistic student utterances based on current state
Deterministic via seeded RNG for reproducibility

3️⃣ LLM Classifier — `llm_classifier.py`

Uses Gemini to analyze tutor responses:

Classifies response type (explain, quiz, clarify, summarize, encourage, scaffold, probe)
Scores quality on 4 dimensions: appropriateness, clarity, pedagogical soundness, engagement
Context-aware: considers student's knowledge, confusion, and recent behavior
Automatic fallback to heuristics if LLM unavailable

4️⃣ Report Generator — `report_generator.py`

Creates visualizations and reports after evaluation:

Generates 4-panel chart: learning progress, core metrics, advanced metrics, composite score
Creates detailed markdown report with transcript and interpretation
Identifies strengths and weaknesses
Saves to results/ directory with timestamp

5️⃣ Purple Agent — `tutor_agent.py`

Baseline AI tutor that interacts with the student:

Reads current prompt/state.
Generates helpful explanations or quizzes via LLM API.
Adapts tone or detail level based on prior feedback (optional).

Developers can later replace this with their own adaptive tutor models to compete on ATE.

4️⃣ Scenario Configuration — `scenario.toml`

Defines which agents participate and how the assessment runs:

[assessment]
name = "Adaptive Teaching Evaluation"
green_agent = "python scenarios/adaptive_tutor/tutor_eval.py"
participants = [
  { id = "teaching_agent", endpoint = "python scenarios/adaptive_tutor/tutor_agent.py" }
]

� Scoring & Metrics

The evaluation computes 10 metrics to assess tutor performance:

Core Metrics (60% weight)

Metric	Description	Weight
learning_gain	Improvement in test scores (0→1 normalized)	25%
appropriateness	Response quality from LLM classifier	15%
clarity	Clarity score from LLM classifier	10%
pedagogical_soundness	Pedagogical effectiveness score	10%

Advanced Metrics (40% weight)

Metric	Description	Weight
adaptive_pacing	Balances explanations, quizzes, and clarifications	12%
misconception_handling	Detects and addresses misconceptions	13%
question_quality	Uses probing and scaffolding questions	6%
scaffolding_effectiveness	Gradual complexity increase effectiveness	4%
engagement_maintenance	Sustains student engagement	5%

Composite Score = Weighted sum of all 10 metrics (0-100 scale)

📊 Output & Results

After each evaluation run, the system generates:

Visualization (`results/evaluation_YYYYMMDD_HHMMSS.png`)

4-panel chart showing:

Learning Progress: Pre-test → Post-test scores with confidence bands
Core Metrics: Horizontal bar chart of 4 primary metrics
Advanced Metrics: Horizontal bar chart of 6 secondary metrics
Composite Score: Overall performance (0-100 scale)

Markdown Report (`results/report_YYYYMMDD_HHMMSS.md`)

Detailed analysis including:

Complete conversation transcript with action classification
Metric interpretation guide
Identified strengths and weaknesses
Per-turn quality and engagement tracking

Files are timestamped to avoid overwriting previous runs.

🧪 Baselines

Agent	Description
Adaptive Agent	Main agent with full adaptive teaching (LLM-powered analysis)
Static Agent	Fixed explain→quiz loop (no adaptation)
Bandit Agent	Random exploration of teaching actions
RAG Agent	Retrieval-augmented responses from physics knowledge base

Run comparison:

python scenarios/adaptive_tutor/compare_agents.py

🔬 Reproducibility

Verify deterministic evaluation:

python verify_reproducibility.py

This runs the same scenario 5 times and confirms all metrics are identical.

💡 Extensions

Multi-agent Sessions: evaluate multiple teaching agents collaborating/competing.
More subjects: Expand beyond Physics to Math, Chemistry, etc.
Advanced metrics: Add reasoning chain analysis, multi-turn coherence scoring.

🤝 Contributing

Contributions are welcome! Submit pull requests for:

new scoring metrics,
learner model improvements, or
baseline agent implementations.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/agents		.github/agents
assets		assets
results		results
scenarios		scenarios
src/agentbeats		src/agentbeats
tests		tests
.gitignore		.gitignore
A2A_CONTRACT.md		A2A_CONTRACT.md
LICENSE		LICENSE
README.md		README.md
SUBMISSION.md		SUBMISSION.md
competition_guide.md		competition_guide.md
pyproject.toml		pyproject.toml
run_baseline_comparison.py		run_baseline_comparison.py
sample.env		sample.env
stress_test.py		stress_test.py
uv.lock		uv.lock
verify_reproducibility.py		verify_reproducibility.py
win_guide.md		win_guide.md

License

ashworks1706/ATeX

Folders and files

Latest commit

History

Repository files navigation

🧠 Adaptive Teaching Evaluation (ATE)

A Competition-Grade Benchmark for Measuring Adaptive Teaching in AI Agents

🌍 Overview

Architecture

🧩 Evaluation Flow

⚙️ Setup & Installation

1. Clone the repo

2. Install dependencies

3. Configure environment

� Evaluation Metrics

Core Learning Metrics (45% weight)

Teaching Quality (26% weight)

Robustness & Reliability (8% weight)

Anti-Gaming Safeguards

🎯 Unique Features

1. Concept-Level Curriculum Tracking

2. Specific Misconception Modeling

3. Adversarial Student Profiles

4. Reasoning Depth Analysis

5. Anti-Gaming Detection

Quick Start

Run the Adaptive Teaching benchmark

🧠 Core Components

1️⃣ Green Agent — tutor_eval.py

2️⃣ Learner Model — student_simulator.py

3️⃣ LLM Classifier — llm_classifier.py

4️⃣ Report Generator — report_generator.py

5️⃣ Purple Agent — tutor_agent.py

🧠 Core Components

1️⃣ Green Agent — tutor_eval.py

2️⃣ Student Simulator — student_simulator.py

3️⃣ LLM Classifier — llm_classifier.py

4️⃣ Report Generator — report_generator.py

5️⃣ Purple Agent — tutor_agent.py

4️⃣ Scenario Configuration — scenario.toml

� Scoring & Metrics

Core Metrics (60% weight)

Advanced Metrics (40% weight)

📊 Output & Results

Visualization (results/evaluation_YYYYMMDD_HHMMSS.png)

Markdown Report (results/report_YYYYMMDD_HHMMSS.md)

🧪 Baselines

🔬 Reproducibility

💡 Extensions

🤝 Contributing

🛡 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1️⃣ Green Agent — `tutor_eval.py`

2️⃣ Learner Model — `student_simulator.py`

3️⃣ LLM Classifier — `llm_classifier.py`

4️⃣ Report Generator — `report_generator.py`

5️⃣ Purple Agent — `tutor_agent.py`

1️⃣ Green Agent — `tutor_eval.py`

2️⃣ Student Simulator — `student_simulator.py`

3️⃣ LLM Classifier — `llm_classifier.py`

4️⃣ Report Generator — `report_generator.py`

5️⃣ Purple Agent — `tutor_agent.py`

4️⃣ Scenario Configuration — `scenario.toml`

Visualization (`results/evaluation_YYYYMMDD_HHMMSS.png`)

Markdown Report (`results/report_YYYYMMDD_HHMMSS.md`)

Packages