Skip to content

ashworks1706/ATeX

Repository files navigation

🧠 Adaptive Teaching Evaluation (ATE)

A Competition-Grade Benchmark for Measuring Adaptive Teaching in AI Agents


🌍 Overview

Adaptive Teaching Evaluation (ATE) is an open, reproducible AgentBeats benchmark (Green Agent) designed to rigorously evaluate how effectively teaching agents adapt to simulated learners with realistic learning patterns, misconceptions, and behavioral challenges.

What makes ATE unique:

  • Concept-Level Curriculum Tracking: Enforces prerequisite chains (can't teach F=ma before Newton's 1st Law)
  • Misconception Detection: Tracks 5 specific misconceptions and penalizes agents who reinforce incorrect beliefs
  • Adversarial Learner Profiles: Tests agents against challenging learners (persistent misconceptions, plateau learners, distracted learners)
  • Reasoning Depth Analysis: LLM-based scoring that rewards explanations of WHY, not just WHAT
  • Anti-Gaming Measures: Detects and penalizes empty encouragement, generic responses, and avoidance tactics

This project is part of the AgentX–AgentBeats competition (Phase 1) at UC Berkeley RDI, contributing a benchmark that goes beyond basic metrics to measure true teaching quality.

Architecture

Green Agent = Evaluator (orchestrates the entire evaluation)

  • Runs the experiment with configurable learner profiles
  • Controls learner model with concept-level knowledge tracking
  • Analyzes agent responses for reasoning depth and pedagogical quality
  • Computes 15+ comprehensive metrics
  • Detects curriculum violations and misconception handling

Purple Agent = Teaching agent being tested

  • Receives prompts from Green Agent about learner state
  • Generates instructional responses (explain, quiz, clarify, scaffold, etc.)
  • Must adapt to learner patterns and address misconceptions
  • Scored on curriculum adherence, reasoning depth, and effectiveness

Learner Model = Realistic learner with deterministic state + LLM responses

  • Tracks concept-level knowledge (Newton's 1st, 2nd, 3rd Law, vectors, F=ma calculations)
  • Models 5 common misconceptions (heavier objects fall faster, force causes velocity, etc.)
  • Enforces prerequisite chains (must master basics before advanced topics)
  • Generates natural language responses via LLM reflecting current cognitive state
  • Supports 3 adversarial profiles: misconception-prone, plateau, distracted

🧩 Evaluation Flow

┌────────────────────────────────────────────────────────────────┐
│                        AgentBeats Platform                     │
├────────────────────────────────────────────────────────────────┤
│ 🟢 Green Agent (Evaluator): tutor_eval.py                      │
│   - Initializes learner with concept knowledge + misconceptions│
│   - Sends instructional scenarios to Purple Agent              │
│   - Analyzes responses for reasoning depth & quality           │
│   - Tracks curriculum violations (teaching out of order)       │
│   - Detects misconception handling (identify & correct)        │
│   - Checks for gaming behaviors (empty praise, avoiding tests) │
│   - Computes 15 metrics across 4 dimensions                    │
│                                                                │
│ 🟣 Purple Agent (Teaching Agent Being Tested): tutor_agent.py  │
│   - Receives prompts about learner understanding               │
│   - Must explain concepts with causal reasoning                │
│   - Should follow prerequisite chains                          │
│   - Identify and correct misconceptions                        │
│   - Adapt instructional strategy to learner profile            │
│                                                                │
│ 📚 Learner Model: student_simulator.py                         │
│   - 9D state: knowledge, confusion, engagement, misc strength  │
│   - Concept-level tracking: newton_1st, newton_2nd, vectors    │
│   - 5 specific misconceptions learners may hold                │
│   - Prerequisite enforcement (e.g., newton_2nd requires 1st)   │
│   - Profile-based behaviors (normal, misconception-prone, etc.)│
│   - LLM-generated responses reflecting cognitive state         │
└────────────────────────────────────────────────────────────────┘

⚙️ Setup & Installation

1. Clone the repo

git clone git@github.com:ashworks1706/ATeX.git
cd ATeX

2. Install dependencies

uv sync

3. Configure environment

cp sample.env .env

Add your LLM API key (Google Gemini, OpenAI, or OpenRouter) in .env.


� Evaluation Metrics

ATE uses 15 comprehensive metrics organized into 4 dimensions:

Core Learning Metrics (45% weight)

  • Learning Gain (0.18): Post-test - pre-test knowledge improvement
  • Misconception Handling (0.15): Effectiveness at identifying and correcting incorrect beliefs
  • Curriculum Adherence (0.12): Following prerequisite chains (e.g., teach newton_1st before newton_2nd)
  • Reasoning Depth (0.10): Explains why with causality, not just what (facts)

Teaching Quality (26% weight)

  • Personalization Efficiency (0.08): Learning gain per interaction
  • Adaptive Pacing (0.08): Matching teaching complexity to student readiness
  • Confusion Repair (0.07): Reduction in student confusion
  • Question Quality (0.06): Use of diagnostic probes and well-designed quizzes
  • Scaffolding Effectiveness (0.05): Quality of step-by-step guidance

Robustness & Reliability (8% weight)

  • Robustness (0.05): Consistency across interactions (low variance)
  • Stability (0.03): Reproducibility across evaluations
  • Pedagogical Soundness (0.03): Overall teaching quality

Anti-Gaming Safeguards

  • Gaming Detection: Penalizes empty encouragement (-0.2), generic responses (-0.25), avoiding assessment (-0.3)
  • Misconception Reinforcement: Zero score if tutor reinforces incorrect beliefs
  • Curriculum Violations: -0.2 per violation (teaching advanced topics before prerequisites)

Composite Score: Weighted average of all metrics (0-1 scale), designed to discriminate between teaching strategies.


🎯 Unique Features

1. Concept-Level Curriculum Tracking

Unlike benchmarks with global "knowledge" scores, ATE tracks mastery of specific concepts:

  • newton_1st: Inertia and balanced forces
  • newton_2nd: F=ma and acceleration
  • newton_3rd: Action-reaction pairs
  • vectors: Magnitude and direction
  • fma_calculations: Numerical problem-solving

Prerequisite enforcement: Can't teach newton_2nd before student masters newton_1st and vectors

2. Specific Misconception Modeling

Five common physics misconceptions with detection and correction tracking:

  • Heavier objects fall faster
  • Force causes velocity (not acceleration)
  • Motion requires constant force
  • Action-reaction pairs cancel out
  • Friction always opposes motion direction

3. Adversarial Student Profiles

Three challenging student types to test tutor robustness:

  • Misconception-Prone: Persistent incorrect beliefs, resistant to correction (confusion_sensitivity: 1.3x)
  • Plateau: Stops learning at ~50% knowledge, requires strategy change
  • Distracted: Loses focus, asks off-topic questions (30% probability), volatile engagement

4. Reasoning Depth Analysis

LLM-based scoring that rewards:

  • Causal explanations (WHY things happen)
  • Analogies and concrete examples
  • Connecting to prior knowledge
  • Proactive misconception addressing

5. Anti-Gaming Detection

Identifies and penalizes low-effort strategies:

  • Empty encouragement without substance
  • Generic copy-paste responses
  • Avoiding quizzes when appropriate
  • Excessive praise to inflate engagement

Quick Start

Run the Adaptive Teaching benchmark

uv run agentbeats-run scenarios/adaptive_tutor/scenario.toml --show-logs

This will:

  • Start the Green Agent (Evaluator) and the Purple Agent (Teaching Agent)
  • Initialize a simulated learner with realistic learning dynamics
  • Run a pre-test, adaptive instruction loop, and post-test
  • Use LLM-powered analysis to classify and score agent responses
  • Output real-time logs of the instructional interaction
  • Generate evaluation charts and markdown report in results/

🧠 Core Components

1️⃣ Green Agent — tutor_eval.py

Orchestrates the entire instructional episode with LLM-powered analysis:

  • Runs pre-testadaptive looppost-test sequence
  • Uses Gemini LLM to classify agent responses (explain, quiz, clarify, scaffold, probe, etc.)
  • Scores response quality based on pedagogical rubrics
  • Calls the Purple Agent for each interaction step
  • Collects comprehensive metrics:
    • Learning Gain — improvement in simulated knowledge
    • Adaptive Pacing — quality of responses matched to learner readiness
    • Misconception Handling — effectiveness at correcting learner errors
    • Question Quality — use of diagnostic probes and quizzes
    • Scaffolding Effectiveness — step-by-step guidance quality
    • Engagement Maintenance — sustained learner interest
    • Confusion Repair — reduction in confusion
    • Plus: Personalization Efficiency, Robustness, Stability
  • Generates visual charts and detailed markdown report

2️⃣ Learner Model — student_simulator.py

Implements a sophisticated, deterministic learner model:

  • Maintains rich internal state: {knowledge, confusion, engagement, misconception_strength, cognitive_load, mastery_confidence}
  • Updates state based on agent action type and response quality
  • Tracks quiz performance (correct_responses, total_attempts)
  • Generates realistic learner utterances based on current state
  • Deterministic via seeded RNG for reproducibility

3️⃣ LLM Classifier — llm_classifier.py

Uses Gemini to analyze agent responses:

  • Classifies response type (explain, quiz, clarify, summarize, encourage, scaffold, probe)
  • Scores quality on 4 dimensions: appropriateness, clarity, pedagogical soundness, engagement
  • Context-aware: considers learner's knowledge, confusion, and recent behavior

4️⃣ Report Generator — report_generator.py

Creates visualizations and reports after evaluation:

  • Generates 4-panel chart: learning progress, core metrics, advanced metrics, composite score
  • Creates detailed markdown report with transcript and interpretation
  • Identifies strengths and weaknesses
  • Saves to results/ directory with timestamp

5️⃣ Purple Agent — tutor_agent.py

Baseline teaching agent that interacts with the learner:

  • Reads current prompt/state.
  • Generates helpful explanations or quizzes via LLM API.
  • Adapts tone or detail level based on prior feedback (optional).

Developers can later replace this with their own adaptive teaching models to compete on ATE.

🧠 Core Components

1️⃣ Green Agent — tutor_eval.py

Orchestrates the entire tutoring session with LLM-powered analysis:

  • Runs pre-testadaptive looppost-test sequence
  • Uses Gemini LLM to classify tutor responses (explain, quiz, clarify, scaffold, probe, etc.)
  • Scores response quality based on pedagogical rubrics
  • Calls the Purple Agent for each interaction step
  • Collects comprehensive metrics:
    • Learning Gain — improvement in simulated knowledge
    • Adaptive Pacing — quality of responses matched to student readiness
    • Misconception Handling — effectiveness at correcting student errors
    • Question Quality — use of diagnostic probes and quizzes
    • Scaffolding Effectiveness — step-by-step guidance quality
    • Engagement Maintenance — sustained student interest
    • Confusion Repair — reduction in confusion
    • Plus: Personalization Efficiency, Robustness, Stability
  • Generates visual charts and detailed markdown report

2️⃣ Student Simulator — student_simulator.py

Implements a sophisticated, deterministic student model:

  • Maintains rich internal state: {knowledge, confusion, engagement, misconception_strength, cognitive_load, mastery_confidence}
  • Updates state based on tutor action type and response quality
  • Tracks quiz performance (correct_responses, total_attempts)
  • Generates realistic student utterances based on current state
  • Deterministic via seeded RNG for reproducibility

3️⃣ LLM Classifier — llm_classifier.py

Uses Gemini to analyze tutor responses:

  • Classifies response type (explain, quiz, clarify, summarize, encourage, scaffold, probe)
  • Scores quality on 4 dimensions: appropriateness, clarity, pedagogical soundness, engagement
  • Context-aware: considers student's knowledge, confusion, and recent behavior
  • Automatic fallback to heuristics if LLM unavailable

4️⃣ Report Generator — report_generator.py

Creates visualizations and reports after evaluation:

  • Generates 4-panel chart: learning progress, core metrics, advanced metrics, composite score
  • Creates detailed markdown report with transcript and interpretation
  • Identifies strengths and weaknesses
  • Saves to results/ directory with timestamp

5️⃣ Purple Agent — tutor_agent.py

Baseline AI tutor that interacts with the student:

  • Reads current prompt/state.
  • Generates helpful explanations or quizzes via LLM API.
  • Adapts tone or detail level based on prior feedback (optional).

Developers can later replace this with their own adaptive tutor models to compete on ATE.


4️⃣ Scenario Configuration — scenario.toml

Defines which agents participate and how the assessment runs:

[assessment]
name = "Adaptive Teaching Evaluation"
green_agent = "python scenarios/adaptive_tutor/tutor_eval.py"
participants = [
  { id = "teaching_agent", endpoint = "python scenarios/adaptive_tutor/tutor_agent.py" }
]


� Scoring & Metrics

The evaluation computes 10 metrics to assess tutor performance:

Core Metrics (60% weight)

Metric Description Weight
learning_gain Improvement in test scores (0→1 normalized) 25%
appropriateness Response quality from LLM classifier 15%
clarity Clarity score from LLM classifier 10%
pedagogical_soundness Pedagogical effectiveness score 10%

Advanced Metrics (40% weight)

Metric Description Weight
adaptive_pacing Balances explanations, quizzes, and clarifications 12%
misconception_handling Detects and addresses misconceptions 13%
question_quality Uses probing and scaffolding questions 6%
scaffolding_effectiveness Gradual complexity increase effectiveness 4%
engagement_maintenance Sustains student engagement 5%

Composite Score = Weighted sum of all 10 metrics (0-100 scale)


📊 Output & Results

After each evaluation run, the system generates:

Visualization (results/evaluation_YYYYMMDD_HHMMSS.png)

4-panel chart showing:

  • Learning Progress: Pre-test → Post-test scores with confidence bands
  • Core Metrics: Horizontal bar chart of 4 primary metrics
  • Advanced Metrics: Horizontal bar chart of 6 secondary metrics
  • Composite Score: Overall performance (0-100 scale)

Markdown Report (results/report_YYYYMMDD_HHMMSS.md)

Detailed analysis including:

  • Complete conversation transcript with action classification
  • Metric interpretation guide
  • Identified strengths and weaknesses
  • Per-turn quality and engagement tracking

Files are timestamped to avoid overwriting previous runs.


🧪 Baselines

Agent Description
Adaptive Agent Main agent with full adaptive teaching (LLM-powered analysis)
Static Agent Fixed explain→quiz loop (no adaptation)
Bandit Agent Random exploration of teaching actions
RAG Agent Retrieval-augmented responses from physics knowledge base

Run comparison:

python scenarios/adaptive_tutor/compare_agents.py

🔬 Reproducibility

Verify deterministic evaluation:

python verify_reproducibility.py

This runs the same scenario 5 times and confirms all metrics are identical.


💡 Extensions

  • Multi-agent Sessions: evaluate multiple teaching agents collaborating/competing.
  • More subjects: Expand beyond Physics to Math, Chemistry, etc.
  • Advanced metrics: Add reasoning chain analysis, multi-turn coherence scoring.

🤝 Contributing

Contributions are welcome! Submit pull requests for:

  • new scoring metrics,
  • learner model improvements, or
  • baseline agent implementations.

🛡 License

MIT License © 2025 Aishwarya Srivastava & Contributors Part of the AgentBeats / AgentX Competition open benchmark ecosystem.

About

Test-bed simulation for bench-marking agents on adaptive feedback efficiency

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages