AgentDebug v4.0 is a comprehensive, self-contained debugging framework for AI agent trajectories. It combines fine-grained error analysis, critical error detection, and reinforcement learning (RL)-guided iterative debugging to autonomously identify, prioritize, and correct failures in agent execution traces.
Built with First Principles reasoning, the system models agent behavior as modular steps (planning, tool use, memory, system) and uses a Q-learning agent to learn optimal debugging strategies over time. Performance is rigorously benchmarked across time, memory, success rate, and learning convergence.
| Feature | Description |
|---|---|
| Error Taxonomy | Probabilistic error injection based on real-world failure modes |
| Trajectory Modeling | Structured representation of agent steps with state, action, module |
| Fine-Grained Analysis | Stochastic error detection with context-aware heuristics |
| Critical Error Detection | Prioritizes planning/tool failures (high-impact) |
| RL-Guided Debugging | Q-learning agent learns which modules to fix first |
| Performance Benchmarking | Time, memory, success rate, Q-value convergence |
| Cross-Platform Memory Tracking | resource (Unix) / tracemalloc (Windows) |
| Extensible Tools & Agents | Plug-in real tools (web search, code analysis) |
| Unit Tested | Full test suite with edge cases |
[Agent Execution] → [Trajectory] → [Fine-Grained Analysis]
↓
[Critical Error Detection] → [RL Agent Chooses Module]
↓
[Correct Step] → [Update Q-Table] → [Repeat until Success]
Step– Atomic unit:(state, action, module)Trajectory– Sequence of steps; evaluates successDebuggingAgent– Q-learning agent over error statesBenchmarker– Measures time/memory/success across runsSimpleAgent– Simulates real agent with tools + memory
| Module | Error Type | Probability |
|---|---|---|
| planning | reasoning_loop |
0.4 |
incomplete_plan |
0.6 | |
| tool | tool_selection_error |
0.5 |
invalid_tool_input |
0.5 | |
| memory | retrieval_failure |
0.7 |
context_overflow |
0.3 | |
| system | execution_timeout |
0.6 |
resource_limit_exceeded |
0.4 |
Errors are injected stochastically during simulation and analysis.
| Tool | Function |
|---|---|
web_search |
Simulates search: "Web search results for: {query}" |
code_analysis |
Returns: "Code analysis results for: {query}" |
Easily extendable via
Tool(name, func, description)
| Scenario | Trajectory Size | Query |
|---|---|---|
| Small | 5 | "Find AI info" |
| Medium | 100 | "Find AI info" |
| Large | 1000 | "Complex query for AI debugging analysis" |
- 10 runs per scenario
- Metrics logged to
benchmark_results.txt - Includes std dev, peak memory, Q-value change
Initial Trajectory Analysis:
+--------+----------+---------------------+---------+
| Step | Module | Action | Error |
+========+==========+=====================+=========+
| 1 | planning | reason_query | None |
| 2 | memory | retrieve_context | retrieval_failure |
| 3 | tool | web_search | None |
| 4 | planning | generate_response | None |
| 5 | memory | store_context | None |
+--------+----------+---------------------+---------+
Iteration 1: Refined feedback: Focused on memory. Critical error at step 2 (memory): retrieval_failure...
Trajectory successful after debugging!
benchmark_results.txt– Full benchmark logs- Console output – Real-time trajectory + debugging steps
python AgentDebugv4.0.txtRuns:
- Single agent simulation
- 3-step iterative debugging
- Final trajectory evaluation
python AgentDebugv4.0.txtAutomatically runs:
- 3 scenarios × 10 runs
- Aggregated metrics
- Logs to
benchmark_results.txt
python -m unittest AgentDebugv4.0.txtOr let the script run them automatically at the end.
- Python 3.8+
- Standard libraries only:
random,logging,unittesttyping,collections,numpytabulate,time,platformresource(Unix),tracemalloc(Windows)
No external dependencies
ERROR_TAXONOMY["new_module"] = {"error_a": 0.7, "error_b": 0.3}def my_tool(query): return f"Result: {query}"
new_tool = Tool("my_tool", my_tool, "Does something useful")
agent = SimpleAgent(tools=[..., new_tool])rl_agent = DebuggingAgent(
modules=[...],
learning_rate=0.2,
discount_factor=0.95,
epsilon=0.3
)- All agent failures are observable in trajectory
- Critical errors (planning/tool) block success
- Debugging is a sequential decision problem → RL
- Measure everything: time, memory, learning
- Self-contained, reproducible, extensible
- Persistent Q-table across runs
- Multi-agent collaborative debugging
- Real LLM integration (via API)
- Visualization dashboard (Matplotlib/Plotly)
- Confidence scoring per correction
numbnut – Built with First Principles reasoning and better axioms.
"If you cannot debug it, you cannot trust it."
MIT License – Free to use, modify, and extend.
AgentDebug v4.0 – Because perfect agents don’t exist. Perfect debuggers do.