Adaptive Teaching Evaluation (ATE) is an open, reproducible AgentBeats benchmark (Green Agent) designed to rigorously evaluate how effectively teaching agents adapt to simulated learners with realistic learning patterns, misconceptions, and behavioral challenges.
What makes ATE unique:
- Concept-Level Curriculum Tracking: Enforces prerequisite chains (can't teach F=ma before Newton's 1st Law)
- Misconception Detection: Tracks 5 specific misconceptions and penalizes agents who reinforce incorrect beliefs
- Adversarial Learner Profiles: Tests agents against challenging learners (persistent misconceptions, plateau learners, distracted learners)
- Reasoning Depth Analysis: LLM-based scoring that rewards explanations of WHY, not just WHAT
- Anti-Gaming Measures: Detects and penalizes empty encouragement, generic responses, and avoidance tactics
This project is part of the AgentX–AgentBeats competition (Phase 1) at UC Berkeley RDI, contributing a benchmark that goes beyond basic metrics to measure true teaching quality.
Green Agent = Evaluator (orchestrates the entire evaluation)
- Runs the experiment with configurable learner profiles
- Controls learner model with concept-level knowledge tracking
- Analyzes agent responses for reasoning depth and pedagogical quality
- Computes 15+ comprehensive metrics
- Detects curriculum violations and misconception handling
Purple Agent = Teaching agent being tested
- Receives prompts from Green Agent about learner state
- Generates instructional responses (explain, quiz, clarify, scaffold, etc.)
- Must adapt to learner patterns and address misconceptions
- Scored on curriculum adherence, reasoning depth, and effectiveness
Learner Model = Realistic learner with deterministic state + LLM responses
- Tracks concept-level knowledge (Newton's 1st, 2nd, 3rd Law, vectors, F=ma calculations)
- Models 5 common misconceptions (heavier objects fall faster, force causes velocity, etc.)
- Enforces prerequisite chains (must master basics before advanced topics)
- Generates natural language responses via LLM reflecting current cognitive state
- Supports 3 adversarial profiles: misconception-prone, plateau, distracted
┌────────────────────────────────────────────────────────────────┐
│ AgentBeats Platform │
├────────────────────────────────────────────────────────────────┤
│ 🟢 Green Agent (Evaluator): tutor_eval.py │
│ - Initializes learner with concept knowledge + misconceptions│
│ - Sends instructional scenarios to Purple Agent │
│ - Analyzes responses for reasoning depth & quality │
│ - Tracks curriculum violations (teaching out of order) │
│ - Detects misconception handling (identify & correct) │
│ - Checks for gaming behaviors (empty praise, avoiding tests) │
│ - Computes 15 metrics across 4 dimensions │
│ │
│ 🟣 Purple Agent (Teaching Agent Being Tested): tutor_agent.py │
│ - Receives prompts about learner understanding │
│ - Must explain concepts with causal reasoning │
│ - Should follow prerequisite chains │
│ - Identify and correct misconceptions │
│ - Adapt instructional strategy to learner profile │
│ │
│ 📚 Learner Model: student_simulator.py │
│ - 9D state: knowledge, confusion, engagement, misc strength │
│ - Concept-level tracking: newton_1st, newton_2nd, vectors │
│ - 5 specific misconceptions learners may hold │
│ - Prerequisite enforcement (e.g., newton_2nd requires 1st) │
│ - Profile-based behaviors (normal, misconception-prone, etc.)│
│ - LLM-generated responses reflecting cognitive state │
└────────────────────────────────────────────────────────────────┘
git clone git@github.com:ashworks1706/ATeX.git
cd ATeXuv synccp sample.env .envAdd your LLM API key (Google Gemini, OpenAI, or OpenRouter) in .env.
ATE uses 15 comprehensive metrics organized into 4 dimensions:
- Learning Gain (0.18): Post-test - pre-test knowledge improvement
- Misconception Handling (0.15): Effectiveness at identifying and correcting incorrect beliefs
- Curriculum Adherence (0.12): Following prerequisite chains (e.g., teach newton_1st before newton_2nd)
- Reasoning Depth (0.10): Explains why with causality, not just what (facts)
- Personalization Efficiency (0.08): Learning gain per interaction
- Adaptive Pacing (0.08): Matching teaching complexity to student readiness
- Confusion Repair (0.07): Reduction in student confusion
- Question Quality (0.06): Use of diagnostic probes and well-designed quizzes
- Scaffolding Effectiveness (0.05): Quality of step-by-step guidance
- Robustness (0.05): Consistency across interactions (low variance)
- Stability (0.03): Reproducibility across evaluations
- Pedagogical Soundness (0.03): Overall teaching quality
- Gaming Detection: Penalizes empty encouragement (-0.2), generic responses (-0.25), avoiding assessment (-0.3)
- Misconception Reinforcement: Zero score if tutor reinforces incorrect beliefs
- Curriculum Violations: -0.2 per violation (teaching advanced topics before prerequisites)
Composite Score: Weighted average of all metrics (0-1 scale), designed to discriminate between teaching strategies.
Unlike benchmarks with global "knowledge" scores, ATE tracks mastery of specific concepts:
newton_1st: Inertia and balanced forcesnewton_2nd: F=ma and accelerationnewton_3rd: Action-reaction pairsvectors: Magnitude and directionfma_calculations: Numerical problem-solving
Prerequisite enforcement: Can't teach newton_2nd before student masters newton_1st and vectors
Five common physics misconceptions with detection and correction tracking:
- Heavier objects fall faster
- Force causes velocity (not acceleration)
- Motion requires constant force
- Action-reaction pairs cancel out
- Friction always opposes motion direction
Three challenging student types to test tutor robustness:
- Misconception-Prone: Persistent incorrect beliefs, resistant to correction (confusion_sensitivity: 1.3x)
- Plateau: Stops learning at ~50% knowledge, requires strategy change
- Distracted: Loses focus, asks off-topic questions (30% probability), volatile engagement
LLM-based scoring that rewards:
- Causal explanations (WHY things happen)
- Analogies and concrete examples
- Connecting to prior knowledge
- Proactive misconception addressing
Identifies and penalizes low-effort strategies:
- Empty encouragement without substance
- Generic copy-paste responses
- Avoiding quizzes when appropriate
- Excessive praise to inflate engagement
uv run agentbeats-run scenarios/adaptive_tutor/scenario.toml --show-logsThis will:
- Start the Green Agent (Evaluator) and the Purple Agent (Teaching Agent)
- Initialize a simulated learner with realistic learning dynamics
- Run a pre-test, adaptive instruction loop, and post-test
- Use LLM-powered analysis to classify and score agent responses
- Output real-time logs of the instructional interaction
- Generate evaluation charts and markdown report in
results/
Orchestrates the entire instructional episode with LLM-powered analysis:
- Runs pre-test → adaptive loop → post-test sequence
- Uses Gemini LLM to classify agent responses (explain, quiz, clarify, scaffold, probe, etc.)
- Scores response quality based on pedagogical rubrics
- Calls the Purple Agent for each interaction step
- Collects comprehensive metrics:
- Learning Gain — improvement in simulated knowledge
- Adaptive Pacing — quality of responses matched to learner readiness
- Misconception Handling — effectiveness at correcting learner errors
- Question Quality — use of diagnostic probes and quizzes
- Scaffolding Effectiveness — step-by-step guidance quality
- Engagement Maintenance — sustained learner interest
- Confusion Repair — reduction in confusion
- Plus: Personalization Efficiency, Robustness, Stability
- Generates visual charts and detailed markdown report
Implements a sophisticated, deterministic learner model:
- Maintains rich internal state:
{knowledge, confusion, engagement, misconception_strength, cognitive_load, mastery_confidence} - Updates state based on agent action type and response quality
- Tracks quiz performance (correct_responses, total_attempts)
- Generates realistic learner utterances based on current state
- Deterministic via seeded RNG for reproducibility
Uses Gemini to analyze agent responses:
- Classifies response type (explain, quiz, clarify, summarize, encourage, scaffold, probe)
- Scores quality on 4 dimensions: appropriateness, clarity, pedagogical soundness, engagement
- Context-aware: considers learner's knowledge, confusion, and recent behavior
Creates visualizations and reports after evaluation:
- Generates 4-panel chart: learning progress, core metrics, advanced metrics, composite score
- Creates detailed markdown report with transcript and interpretation
- Identifies strengths and weaknesses
- Saves to
results/directory with timestamp
Baseline teaching agent that interacts with the learner:
- Reads current prompt/state.
- Generates helpful explanations or quizzes via LLM API.
- Adapts tone or detail level based on prior feedback (optional).
Developers can later replace this with their own adaptive teaching models to compete on ATE.
Orchestrates the entire tutoring session with LLM-powered analysis:
- Runs pre-test → adaptive loop → post-test sequence
- Uses Gemini LLM to classify tutor responses (explain, quiz, clarify, scaffold, probe, etc.)
- Scores response quality based on pedagogical rubrics
- Calls the Purple Agent for each interaction step
- Collects comprehensive metrics:
- Learning Gain — improvement in simulated knowledge
- Adaptive Pacing — quality of responses matched to student readiness
- Misconception Handling — effectiveness at correcting student errors
- Question Quality — use of diagnostic probes and quizzes
- Scaffolding Effectiveness — step-by-step guidance quality
- Engagement Maintenance — sustained student interest
- Confusion Repair — reduction in confusion
- Plus: Personalization Efficiency, Robustness, Stability
- Generates visual charts and detailed markdown report
Implements a sophisticated, deterministic student model:
- Maintains rich internal state:
{knowledge, confusion, engagement, misconception_strength, cognitive_load, mastery_confidence} - Updates state based on tutor action type and response quality
- Tracks quiz performance (correct_responses, total_attempts)
- Generates realistic student utterances based on current state
- Deterministic via seeded RNG for reproducibility
Uses Gemini to analyze tutor responses:
- Classifies response type (explain, quiz, clarify, summarize, encourage, scaffold, probe)
- Scores quality on 4 dimensions: appropriateness, clarity, pedagogical soundness, engagement
- Context-aware: considers student's knowledge, confusion, and recent behavior
- Automatic fallback to heuristics if LLM unavailable
Creates visualizations and reports after evaluation:
- Generates 4-panel chart: learning progress, core metrics, advanced metrics, composite score
- Creates detailed markdown report with transcript and interpretation
- Identifies strengths and weaknesses
- Saves to
results/directory with timestamp
Baseline AI tutor that interacts with the student:
- Reads current prompt/state.
- Generates helpful explanations or quizzes via LLM API.
- Adapts tone or detail level based on prior feedback (optional).
Developers can later replace this with their own adaptive tutor models to compete on ATE.
Defines which agents participate and how the assessment runs:
[assessment]
name = "Adaptive Teaching Evaluation"
green_agent = "python scenarios/adaptive_tutor/tutor_eval.py"
participants = [
{ id = "teaching_agent", endpoint = "python scenarios/adaptive_tutor/tutor_agent.py" }
]The evaluation computes 10 metrics to assess tutor performance:
| Metric | Description | Weight |
|---|---|---|
| learning_gain | Improvement in test scores (0→1 normalized) | 25% |
| appropriateness | Response quality from LLM classifier | 15% |
| clarity | Clarity score from LLM classifier | 10% |
| pedagogical_soundness | Pedagogical effectiveness score | 10% |
| Metric | Description | Weight |
|---|---|---|
| adaptive_pacing | Balances explanations, quizzes, and clarifications | 12% |
| misconception_handling | Detects and addresses misconceptions | 13% |
| question_quality | Uses probing and scaffolding questions | 6% |
| scaffolding_effectiveness | Gradual complexity increase effectiveness | 4% |
| engagement_maintenance | Sustains student engagement | 5% |
Composite Score = Weighted sum of all 10 metrics (0-100 scale)
After each evaluation run, the system generates:
4-panel chart showing:
- Learning Progress: Pre-test → Post-test scores with confidence bands
- Core Metrics: Horizontal bar chart of 4 primary metrics
- Advanced Metrics: Horizontal bar chart of 6 secondary metrics
- Composite Score: Overall performance (0-100 scale)
Detailed analysis including:
- Complete conversation transcript with action classification
- Metric interpretation guide
- Identified strengths and weaknesses
- Per-turn quality and engagement tracking
Files are timestamped to avoid overwriting previous runs.
| Agent | Description |
|---|---|
| Adaptive Agent | Main agent with full adaptive teaching (LLM-powered analysis) |
| Static Agent | Fixed explain→quiz loop (no adaptation) |
| Bandit Agent | Random exploration of teaching actions |
| RAG Agent | Retrieval-augmented responses from physics knowledge base |
Run comparison:
python scenarios/adaptive_tutor/compare_agents.pyVerify deterministic evaluation:
python verify_reproducibility.pyThis runs the same scenario 5 times and confirms all metrics are identical.
- Multi-agent Sessions: evaluate multiple teaching agents collaborating/competing.
- More subjects: Expand beyond Physics to Math, Chemistry, etc.
- Advanced metrics: Add reasoning chain analysis, multi-turn coherence scoring.
Contributions are welcome! Submit pull requests for:
- new scoring metrics,
- learner model improvements, or
- baseline agent implementations.
MIT License © 2025 Aishwarya Srivastava & Contributors Part of the AgentBeats / AgentX Competition open benchmark ecosystem.