AI Agent Framework - Hybrid Architecture

A production-grade AI agent system implementing ReAct (Reasoning + Acting) and Reflexion (Self-Correction) architectures from first principles.
Built for modularity, low latency, and token-cost efficiency.

🎯 What This Is

An autonomous AI agent that combines:

System 1 Thinking (ReAct): Fast, reactive tool execution for straightforward tasks
System 2 Thinking (Reflexion): Self-reflective retry strategies for complex, multi-step problems

Unlike black-box frameworks (LangChain, AutoGPT), this project gives you full control over:

Context window management
Prompt construction logic
Error handling and retry strategies
Cost-latency trade-offs

Why Build This? To bridge the gap between basic LLM tool use and robust, fault-tolerant problem-solving while maintaining transparency and control over every architectural decision.

🎬 See It In Action

Example 1: Fast ReAct Agent - Financial Calculation

Task: "What is 15% of Apple's current stock price?"

Agent: ReAct (System 1 - Fast execution)

What happens:

🔍 Searches web for Apple stock price
🧮 Uses calculator to compute 15%
⚡ Completes in ~8 seconds

⚡ Execution Time: ~4.4s | 💰 Est. Cost: $0.008 | 🔄 Trials: 1

Use case: Quick, straightforward tasks where speed matters

Example 2: Smart Reflexion Agent - Current Events

Task: "Who won the latest Formula 1 race?"

Agent: Reflexion (System 2 - Self-correcting)

What happens:

Trial 1: Actor searches for latest F1 race results
Evaluator: Validates the information is current and accurate
✅ Success on first try - Answer verified and returned

⚡ Execution Time: ~10.5s | 💰 Est. Cost: $0.014 | 🔄 Trials: 1 ✅

Note: Even though Reflexion succeeded on the first trial, the evaluation step still ran to verify correctness—this is why it's slightly slower than pure ReAct but more reliable.

Use case: Complex queries requiring verification and multi-step reasoning

Comparison: ReAct vs Reflexion

Metric	ReAct (demo.gif)	Reflexion (demo1.gif)	Trade-off
Speed	4.4s	10.5s	Reflexion +23% slower
Cost	$0.008	$0.014	Reflexion +75% costlier
Reliability	Medium	High	Reflexion has self-correction
Best For	Simple queries	Complex/ambiguous tasks

What's Happening Behind the Scenes

ReAct Flow (demo.gif):

Actor (Groq/Llama 3.3): Executes Thought-Action-Observation loop
Fast tool execution (<2s per step)
Returns answer when task complete

Reflexion Flow (demo1.gif):

Actor: Executes ReAct loop
Evaluator (Gemini 2.5): Judges if trajectory is correct
Reflector: On failure, performs root cause analysis
Memory: Injects lessons into next retry attempt
Returns verified answer or admits failure after max trials

🏗️ System Architecture: The "Dual-Brain" Design

We treat model selection as an architectural decision, separating fast execution from deep reasoning.

   ┌─────────────────────────────────────────────────────────────┐
   │                   REFLEXION ORCHESTRATOR                    │
   │          (Manages trial loops & memory injection)           │
   └──────────────────────────────┬──────────────────────────────┘
                                  │
                                  ▼
                       ┌─────────────────────┐
                       │   Actor (ReAct)     │◄───────┐
                       │  Groq/Llama 3.3     │        │
                       │  (Fast & Cheap)     │        │
                       └──────────┬──────────┘        │
                                  │                   │
                                  │ Trajectory        │
                                  ▼                   │
                       ┌─────────────────────┐        │
                       │     Evaluator       │        │
                       │   Gemini 2.5 Flash  │        │
                       │  (Slow & Smart)     │        │
                       └──────────┬──────────┘        │
                                  │                   │
                  ┌───────────────┴───────────────┐   │
              (Success)                       (Failure)│
                  │                               │   │
                  ▼                               ▼   │
           ┌────────────┐              ┌───────────────────┐
           │ Return Ans │              │    Reflector      │
           └────────────┘              │ Root Cause + Plan │
                                       └─────────┬─────────┘
                                                 │
                                                 ▼
                                       ┌───────────────────┐
                                       │ Episodic Memory   │
                                       │ (Injected Next Try)│
                                       └───────────────────┘

Key Components

Component	Role	Implementation
Actor	Executes Thought-Action-Observation loops	Groq API (Llama 3.3-70B, 300+ tok/s)
Evaluator	Judges success/failure of trajectories	Google Gemini 2.5 (Native SDK)
Reflector	Performs root cause analysis on failures	Structured prompt templates
Memory	Stores lessons across retry attempts	Episodic buffer (v0.3: Vector DB)
Tools	External capabilities (search, math, web scraping)	Abstract base class for extensibility

🛡️ Production Engineering: Lessons from the Trenches

Building an agent is easy. Making it reliable is hard. Here's what we learned shipping this to real workloads.

1. The Model Drift Problem

Challenge: Migrating from local Llama to cloud APIs revealed "chatter"—models would output <think> blocks, markdown fences, or conversational asides that broke JSON parsers.

War Story: On Oct 15, 2025, we hit a 100% crash rate when Llama 3.3 started prefixing responses with Let me think through this... instead of valid JSON.

Solution: Built a battle-hardened parser with regex stripping:

# Strips: <think>, ```json, markdown, extra whitespace
def extract_action(text: str) -> dict:
    text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
    text = re.sub(r'```json\s*|\s*```', '', text)
    # ... 6 more sanitization layers

Result: Zero parser crashes in 2000+ test runs across 4 model providers.

2. The Context Window Firehose

Challenge: Tools like web_browse would dump 50k+ tokens of raw HTML into context, causing:

"Lost in the Middle" syndrome (LLM ignores key facts)
Cost explosions ($0.50/query → $4.20/query)
Context limit crashes on 8k window models

Solution: Implemented strict truncation at the tool layer:

def execute(self, url: str) -> str:
    html = fetch(url)
    text = strip_tags(html)
    return text[:5000]  # Hard limit: 5k chars

Evolution (v0.3 Roadmap):

Use a cheap 8B model to summarize tool output before feeding to main agent
Adaptive truncation based on context budget

Metrics:

Cost reduced by 73% on web-heavy tasks
Avg context usage: 6.2k → 2.8k tokens

3. API Reliability & Dependency Chaos

Challenge:

Hit 429 rate limits on Groq during peak hours
Google silently deprecated Gemini 1.5 endpoints (404 errors)
Local Ollama server crashes on OOM

Solutions:

Exponential backoff with jitter (3 retries, max 16s wait)

Provider abstraction layer for hot-swapping:

llm = LLMFactory.create(provider="groq")  # Or "google", "ollama"

Native SDK migration for Google (decoupled from OpenAI compatibility shim)

Learning: Never assume external APIs are stable. Always have a fallback path.

4. The Cost-Intelligence Trade-off

Benchmark Discovery: Running full Reflexion on simple tasks is wasteful.

Task	ReAct (System 1)	Reflexion (System 2)	Overhead
`2 + 2`	1.7s, $0.001	7.3s, $0.004	4.3× slower, 4× costlier
Complex Finance	32.8s, $0.02	43.7s, $0.03	1.3× slower, 1.5× costlier

Key Insight: The evaluation step adds ~2-11s of latency. For production, we need adaptive routing:

Route simple queries → Fast ReAct path
Route ambiguous/complex queries → Reflexion path

Planned for v0.3: Task complexity classifier using embedding similarity.

📊 Performance Benchmarks (v0.2)

Tested on M1 MacBook Air, averaging 10 runs per task.

Latency Comparison

Metric	ReAct (System 1)	Reflexion (System 2)	Notes
Simple Math	1.71s	3.49s	Reflexion adds ~2s eval overhead
Complex Finance	32.77s	43.67s	Reflexion adds ~11s eval overhead
Multi-Step Web	8.2s	14.6s	Network I/O dominates

Robustness Comparison

Error Type	ReAct Recovery	Reflexion Recovery
Tool misuse	❌ (crashes)	✅ (reflects + retries)
Malformed JSON	❌ (parser error)	✅ (hardened parser)
API rate limits	⚠️ (exponential backoff)	⚠️ (same)

Cost Analysis (Estimated)

Task Complexity	ReAct	Reflexion	Breakdown
Simple (1-2 steps)	~$0.001	~$0.003	Eval costs 2× input tokens
Medium (3-5 steps)	~$0.008	~$0.015	Memory injection adds context
Complex (6+ steps)	~$0.025	~$0.040	Multiple retry loops possible

Recommendation: Use Reflexion for high-stakes tasks where correctness > speed. Use ReAct for low-latency, high-volume workloads.

🚀 Quick Start

Prerequisites

Python 3.9+
API Keys: Groq (free tier: 30 req/min), Google AI Studio (free tier available)

Installation

# Clone repository
git clone https://github.com/yourusername/ai-agent-framework.git
cd ai-agent-framework

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

GROQ_API_KEY=gsk_your_key_here
GOOGLE_API_KEY=AIza_your_key_here

Usage Examples

1. Fast ReAct Agent (For Simple Tasks)

python run_agent.py --agent react "What is 15% of Apple's current stock price?"

2. Smart Reflexion Agent (For Complex Logic)

python run_agent.py --agent reflexion "Who is older: King Charles III or Donald Trump?"

3. Run A/B Benchmarks

python run_comparison.py

Outputs detailed latency/cost metrics to benchmarks/results.json.

📁 Project Structure

.
├── README.md                # You are here
├── requirements.txt         # Dependencies (requests, groq, google-generativeai)
├── .env.example             # Template for API keys
├── run_agent.py             # CLI entry point
├── run_comparison.py        # A/B testing harness
├── benchmarks/              
│   └── results/             # Performance logs (JSON + charts)
└── src/
    ├── architectures/       # Core agent logic
    │   ├── react.py         # ReAct loop (Actor)
    │   └── reflexion.py     # Reflexion orchestrator
    ├── components/          # Pluggable strategy modules
    │   ├── evaluators/      # Trajectory success/failure judges
    │   ├── reflectors/      # Root cause analysis generators
    │   └── memory/          # Episodic memory (key-value store)
    ├── llm/                 # Unified LLM interface
    │   ├── factory.py       # Provider abstraction
    │   ├── groq_interface.py    # Llama 3.3 / Qwen (OpenAI SDK)
    │   └── google_interface.py  # Gemini 2.5 (Native SDK)
    ├── tools/               # External capabilities
    │   ├── base.py          # Abstract tool interface
    │   ├── general_tools.py # Calculator, search, datetime
    │   ├── web_tools.py     # Web scraper (BeautifulSoup)
    │   └── finance_tools.py # Stock price API (Alpha Vantage)
    └── utils/
        ├── parser.py        # Hardened JSON extractor
        └── logger.py        # Structured logging

🗺️ Roadmap

✅ v0.1: Foundation

Sequential ReAct agent implementation
Tool abstraction layer (Calculator, Search, Web Browse)
LLM interface with provider switching
Basic error handling

✅ v0.2: Self-Correction

Hybrid architecture (Groq + Google)
Reflexion orchestrator with trial loops
Battle-hardened parser (markdown/chatter stripping)
Benchmarking suite with cost tracking
Exponential backoff for API resilience

🚧 v0.3: Intelligence Layer

Long-Term Memory: RAG with Pinecone/Weaviate for cross-session learning
Adaptive Routing: Complexity classifier (embedding-based) to choose ReAct vs Reflexion
Semantic Caching: Redis layer to avoid redundant LLM calls
Tool Output Compression: Use Llama 3.2-3B to summarize verbose tool responses
Observability: OpenTelemetry tracing for latency profiling

📅 v0.4: Production Hardening

Horizontal scaling: Multi-agent task distribution
Tool marketplace: Dynamic tool loading (50+ tools)
Automated evaluation: DSPy-based scoring (replace manual validation)
Deployment: Docker + FastAPI + Kubernetes manifests
Security: Input sanitization, rate limiting, audit logs

🎓 Technical Deep Dives

Why No LangChain?

Philosophy: Zero abstraction debt.

LangChain is powerful but introduces:

Magic: Hidden context window management can cause silent truncation
Versioning Hell: Breaking changes between 0.0.x releases
Debugging Opacity: Nested callbacks make error traces unreadable

Our approach:

# Explicit control over every token
messages = [
    {"role": "system", "content": system_prompt},
    *history,  # Full visibility into what's in context
    {"role": "user", "content": user_query}
]
response = llm.chat(messages, max_tokens=1000)

Trade-off: We write more boilerplate but gain:

Precise token budgets (critical for cost optimization)
Deterministic behavior (no "framework magic" failures)
Easy migration between providers (uniform interface)

The Reflexion Algorithm Explained

Core Idea: If an agent fails, make it explain why it failed and how to fix it, then retry with that knowledge.

Pseudocode:

memory = []
for trial in range(max_trials):
    trajectory = actor.run(task, memory)  # ReAct loop
    
    if evaluator.is_correct(trajectory):
        return trajectory.final_answer
    
    reflection = reflector.analyze(trajectory)  # Root cause
    memory.append(reflection)  # Inject into next trial
    
return "Failed after all trials"

Example Reflection:

Trial 1 failed because:
- I searched "King Charles age" instead of "birth date"
- Age changes yearly; birth date is static

Next trial: Search for exact birth dates to enable comparison

Why This Works: The reflection acts as a "debugger" that teaches the agent from its own mistakes.

🤝 Contributing

Contributions welcome! Focus areas:

New Tools: Add domain-specific tools (SQL, file I/O, API clients)
Evaluators: Improve correctness checking (semantic similarity, fact verification)
Benchmarks: Add new test cases (especially adversarial ones)

Process:

Fork the repo
Create a feature branch (git checkout -b feature/amazing-tool)
Write tests in tests/ (we use pytest)
Commit with conventional commits (feat: add SQL query tool)
Open a PR with benchmark results

📚 References & Inspiration

Papers:

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)
Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023)

Projects:

LangChain - For what to avoid
AutoGPT - Loop structure inspiration
DSPy - Evaluation patterns

📝 License

MIT License - See LICENSE for full details.

TL;DR: Use this commercially, modify freely, just keep the license notice.

Attribution:
If you use, modify, or redistribute any part of this project, please retain attribution to the original author:

AI Agent Framework – Hybrid Architecture
Author: Riya Sangwan
Repository: https://github.com/ria-19/ai_agent

💬 Contact

Author: Riya Sangwan Email: riya.sangwandec19@example.com
LinkedIn: linkedin.com/in/riyasangwan/

Looking for Work? I'm currently seeking roles in Applied AI/ML Engineering at startups, unicorns, and top-tier tech companies. This project demonstrates:

Production ML system design
LLM orchestration & prompt engineering
Cost-performance optimization
API resilience & error handling
Technical writing & documentation

Open to: Full-time roles, contract work, or technical consulting in the AI agent space.

🙏 Acknowledgments

Groq team for LPU access and responsive API support
Google AI Studio for generous free tier
The open-source community for inspiration

Star this repo if you found it useful! ⭐
Watch releases to get notified about v0.3 (Memory + Routing).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_agent.py		run_agent.py
run_comparison.py		run_comparison.py

License

ria-19/ai_agent

Folders and files

Latest commit

History

Repository files navigation