A production-grade AI agent system implementing ReAct (Reasoning + Acting) and Reflexion (Self-Correction) architectures from first principles.
Built for modularity, low latency, and token-cost efficiency.
An autonomous AI agent that combines:
- System 1 Thinking (ReAct): Fast, reactive tool execution for straightforward tasks
- System 2 Thinking (Reflexion): Self-reflective retry strategies for complex, multi-step problems
Unlike black-box frameworks (LangChain, AutoGPT), this project gives you full control over:
- Context window management
- Prompt construction logic
- Error handling and retry strategies
- Cost-latency trade-offs
Why Build This? To bridge the gap between basic LLM tool use and robust, fault-tolerant problem-solving while maintaining transparency and control over every architectural decision.
Task: "What is 15% of Apple's current stock price?"
Agent: ReAct (System 1 - Fast execution)
What happens:
- 🔍 Searches web for Apple stock price
- 🧮 Uses calculator to compute 15%
- ⚡ Completes in ~8 seconds
⚡ Execution Time: ~4.4s | 💰 Est. Cost: $0.008 | 🔄 Trials: 1
Use case: Quick, straightforward tasks where speed matters
Task: "Who won the latest Formula 1 race?"
Agent: Reflexion (System 2 - Self-correcting)
What happens:
- Trial 1: Actor searches for latest F1 race results
- Evaluator: Validates the information is current and accurate
- ✅ Success on first try - Answer verified and returned
⚡ Execution Time: ~10.5s | 💰 Est. Cost: $0.014 | 🔄 Trials: 1 ✅
Note: Even though Reflexion succeeded on the first trial, the evaluation step still ran to verify correctness—this is why it's slightly slower than pure ReAct but more reliable.
Use case: Complex queries requiring verification and multi-step reasoning
| Metric | ReAct (demo.gif) | Reflexion (demo1.gif) | Trade-off |
|---|---|---|---|
| Speed | 4.4s | 10.5s | Reflexion +23% slower |
| Cost | $0.008 | $0.014 | Reflexion +75% costlier |
| Reliability | Medium | High | Reflexion has self-correction |
| Best For | Simple queries | Complex/ambiguous tasks |
ReAct Flow (demo.gif):
- Actor (Groq/Llama 3.3): Executes Thought-Action-Observation loop
- Fast tool execution (<2s per step)
- Returns answer when task complete
Reflexion Flow (demo1.gif):
- Actor: Executes ReAct loop
- Evaluator (Gemini 2.5): Judges if trajectory is correct
- Reflector: On failure, performs root cause analysis
- Memory: Injects lessons into next retry attempt
- Returns verified answer or admits failure after max trials
We treat model selection as an architectural decision, separating fast execution from deep reasoning.
┌─────────────────────────────────────────────────────────────┐
│ REFLEXION ORCHESTRATOR │
│ (Manages trial loops & memory injection) │
└──────────────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────┐
│ Actor (ReAct) │◄───────┐
│ Groq/Llama 3.3 │ │
│ (Fast & Cheap) │ │
└──────────┬──────────┘ │
│ │
│ Trajectory │
▼ │
┌─────────────────────┐ │
│ Evaluator │ │
│ Gemini 2.5 Flash │ │
│ (Slow & Smart) │ │
└──────────┬──────────┘ │
│ │
┌───────────────┴───────────────┐ │
(Success) (Failure)│
│ │ │
▼ ▼ │
┌────────────┐ ┌───────────────────┐
│ Return Ans │ │ Reflector │
└────────────┘ │ Root Cause + Plan │
└─────────┬─────────┘
│
▼
┌───────────────────┐
│ Episodic Memory │
│ (Injected Next Try)│
└───────────────────┘
| Component | Role | Implementation |
|---|---|---|
| Actor | Executes Thought-Action-Observation loops | Groq API (Llama 3.3-70B, 300+ tok/s) |
| Evaluator | Judges success/failure of trajectories | Google Gemini 2.5 (Native SDK) |
| Reflector | Performs root cause analysis on failures | Structured prompt templates |
| Memory | Stores lessons across retry attempts | Episodic buffer (v0.3: Vector DB) |
| Tools | External capabilities (search, math, web scraping) | Abstract base class for extensibility |
Building an agent is easy. Making it reliable is hard. Here's what we learned shipping this to real workloads.
Challenge: Migrating from local Llama to cloud APIs revealed "chatter"—models would output <think> blocks, markdown fences, or conversational asides that broke JSON parsers.
War Story: On Oct 15, 2025, we hit a 100% crash rate when Llama 3.3 started prefixing responses with Let me think through this... instead of valid JSON.
Solution: Built a battle-hardened parser with regex stripping:
# Strips: <think>, ```json, markdown, extra whitespace
def extract_action(text: str) -> dict:
text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
text = re.sub(r'```json\s*|\s*```', '', text)
# ... 6 more sanitization layersResult: Zero parser crashes in 2000+ test runs across 4 model providers.
Challenge: Tools like web_browse would dump 50k+ tokens of raw HTML into context, causing:
- "Lost in the Middle" syndrome (LLM ignores key facts)
- Cost explosions ($0.50/query → $4.20/query)
- Context limit crashes on 8k window models
Solution: Implemented strict truncation at the tool layer:
def execute(self, url: str) -> str:
html = fetch(url)
text = strip_tags(html)
return text[:5000] # Hard limit: 5k charsEvolution (v0.3 Roadmap):
- Use a cheap 8B model to summarize tool output before feeding to main agent
- Adaptive truncation based on context budget
Metrics:
- Cost reduced by 73% on web-heavy tasks
- Avg context usage: 6.2k → 2.8k tokens
Challenge:
- Hit 429 rate limits on Groq during peak hours
- Google silently deprecated Gemini 1.5 endpoints (404 errors)
- Local Ollama server crashes on OOM
Solutions:
- Exponential backoff with jitter (3 retries, max 16s wait)
- Provider abstraction layer for hot-swapping:
llm = LLMFactory.create(provider="groq") # Or "google", "ollama"
- Native SDK migration for Google (decoupled from OpenAI compatibility shim)
Learning: Never assume external APIs are stable. Always have a fallback path.
Benchmark Discovery: Running full Reflexion on simple tasks is wasteful.
| Task | ReAct (System 1) | Reflexion (System 2) | Overhead |
|---|---|---|---|
2 + 2 |
1.7s, $0.001 | 7.3s, $0.004 | 4.3× slower, 4× costlier |
| Complex Finance | 32.8s, $0.02 | 43.7s, $0.03 | 1.3× slower, 1.5× costlier |
Key Insight: The evaluation step adds ~2-11s of latency. For production, we need adaptive routing:
- Route simple queries → Fast ReAct path
- Route ambiguous/complex queries → Reflexion path
Planned for v0.3: Task complexity classifier using embedding similarity.
Tested on M1 MacBook Air, averaging 10 runs per task.
| Metric | ReAct (System 1) | Reflexion (System 2) | Notes |
|---|---|---|---|
| Simple Math | 1.71s | 3.49s | Reflexion adds ~2s eval overhead |
| Complex Finance | 32.77s | 43.67s | Reflexion adds ~11s eval overhead |
| Multi-Step Web | 8.2s | 14.6s | Network I/O dominates |
| Error Type | ReAct Recovery | Reflexion Recovery |
|---|---|---|
| Tool misuse | ❌ (crashes) | ✅ (reflects + retries) |
| Malformed JSON | ❌ (parser error) | ✅ (hardened parser) |
| API rate limits |
| Task Complexity | ReAct | Reflexion | Breakdown |
|---|---|---|---|
| Simple (1-2 steps) | ~$0.001 | ~$0.003 | Eval costs 2× input tokens |
| Medium (3-5 steps) | ~$0.008 | ~$0.015 | Memory injection adds context |
| Complex (6+ steps) | ~$0.025 | ~$0.040 | Multiple retry loops possible |
Recommendation: Use Reflexion for high-stakes tasks where correctness > speed. Use ReAct for low-latency, high-volume workloads.
- Python 3.9+
- API Keys: Groq (free tier: 30 req/min), Google AI Studio (free tier available)
# Clone repository
git clone https://github.com/yourusername/ai-agent-framework.git
cd ai-agent-framework
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
GROQ_API_KEY=gsk_your_key_here
GOOGLE_API_KEY=AIza_your_key_here1. Fast ReAct Agent (For Simple Tasks)
python run_agent.py --agent react "What is 15% of Apple's current stock price?"2. Smart Reflexion Agent (For Complex Logic)
python run_agent.py --agent reflexion "Who is older: King Charles III or Donald Trump?"3. Run A/B Benchmarks
python run_comparison.pyOutputs detailed latency/cost metrics to benchmarks/results.json.
.
├── README.md # You are here
├── requirements.txt # Dependencies (requests, groq, google-generativeai)
├── .env.example # Template for API keys
├── run_agent.py # CLI entry point
├── run_comparison.py # A/B testing harness
├── benchmarks/
│ └── results/ # Performance logs (JSON + charts)
└── src/
├── architectures/ # Core agent logic
│ ├── react.py # ReAct loop (Actor)
│ └── reflexion.py # Reflexion orchestrator
├── components/ # Pluggable strategy modules
│ ├── evaluators/ # Trajectory success/failure judges
│ ├── reflectors/ # Root cause analysis generators
│ └── memory/ # Episodic memory (key-value store)
├── llm/ # Unified LLM interface
│ ├── factory.py # Provider abstraction
│ ├── groq_interface.py # Llama 3.3 / Qwen (OpenAI SDK)
│ └── google_interface.py # Gemini 2.5 (Native SDK)
├── tools/ # External capabilities
│ ├── base.py # Abstract tool interface
│ ├── general_tools.py # Calculator, search, datetime
│ ├── web_tools.py # Web scraper (BeautifulSoup)
│ └── finance_tools.py # Stock price API (Alpha Vantage)
└── utils/
├── parser.py # Hardened JSON extractor
└── logger.py # Structured logging
- Sequential ReAct agent implementation
- Tool abstraction layer (Calculator, Search, Web Browse)
- LLM interface with provider switching
- Basic error handling
- Hybrid architecture (Groq + Google)
- Reflexion orchestrator with trial loops
- Battle-hardened parser (markdown/chatter stripping)
- Benchmarking suite with cost tracking
- Exponential backoff for API resilience
- Long-Term Memory: RAG with Pinecone/Weaviate for cross-session learning
- Adaptive Routing: Complexity classifier (embedding-based) to choose ReAct vs Reflexion
- Semantic Caching: Redis layer to avoid redundant LLM calls
- Tool Output Compression: Use Llama 3.2-3B to summarize verbose tool responses
- Observability: OpenTelemetry tracing for latency profiling
- Horizontal scaling: Multi-agent task distribution
- Tool marketplace: Dynamic tool loading (50+ tools)
- Automated evaluation: DSPy-based scoring (replace manual validation)
- Deployment: Docker + FastAPI + Kubernetes manifests
- Security: Input sanitization, rate limiting, audit logs
Philosophy: Zero abstraction debt.
LangChain is powerful but introduces:
- Magic: Hidden context window management can cause silent truncation
- Versioning Hell: Breaking changes between 0.0.x releases
- Debugging Opacity: Nested callbacks make error traces unreadable
Our approach:
# Explicit control over every token
messages = [
{"role": "system", "content": system_prompt},
*history, # Full visibility into what's in context
{"role": "user", "content": user_query}
]
response = llm.chat(messages, max_tokens=1000)Trade-off: We write more boilerplate but gain:
- Precise token budgets (critical for cost optimization)
- Deterministic behavior (no "framework magic" failures)
- Easy migration between providers (uniform interface)
Core Idea: If an agent fails, make it explain why it failed and how to fix it, then retry with that knowledge.
Pseudocode:
memory = []
for trial in range(max_trials):
trajectory = actor.run(task, memory) # ReAct loop
if evaluator.is_correct(trajectory):
return trajectory.final_answer
reflection = reflector.analyze(trajectory) # Root cause
memory.append(reflection) # Inject into next trial
return "Failed after all trials"Example Reflection:
Trial 1 failed because:
- I searched "King Charles age" instead of "birth date"
- Age changes yearly; birth date is static
Next trial: Search for exact birth dates to enable comparison
Why This Works: The reflection acts as a "debugger" that teaches the agent from its own mistakes.
Contributions welcome! Focus areas:
- New Tools: Add domain-specific tools (SQL, file I/O, API clients)
- Evaluators: Improve correctness checking (semantic similarity, fact verification)
- Benchmarks: Add new test cases (especially adversarial ones)
Process:
- Fork the repo
- Create a feature branch (
git checkout -b feature/amazing-tool) - Write tests in
tests/(we use pytest) - Commit with conventional commits (
feat: add SQL query tool) - Open a PR with benchmark results
Papers:
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)
- Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)
Projects:
MIT License - See LICENSE for full details.
TL;DR: Use this commercially, modify freely, just keep the license notice.
Attribution:
If you use, modify, or redistribute any part of this project, please retain attribution to the original author:
AI Agent Framework – Hybrid Architecture
Author: Riya Sangwan
Repository: https://github.com/ria-19/ai_agent
Author: Riya Sangwan
Email: riya.sangwandec19@example.com
LinkedIn: linkedin.com/in/riyasangwan/
Looking for Work? I'm currently seeking roles in Applied AI/ML Engineering at startups, unicorns, and top-tier tech companies. This project demonstrates:
- Production ML system design
- LLM orchestration & prompt engineering
- Cost-performance optimization
- API resilience & error handling
- Technical writing & documentation
Open to: Full-time roles, contract work, or technical consulting in the AI agent space.
- Groq team for LPU access and responsive API support
- Google AI Studio for generous free tier
- The open-source community for inspiration
Star this repo if you found it useful! ⭐
Watch releases to get notified about v0.3 (Memory + Routing).

