Skip to content

A production-grade AI agent system implementing the ReAct (Reasoning + Acting) architecture from first principles. Built with modularity, extensibility, and cost-efficiency in mind.

License

Notifications You must be signed in to change notification settings

ria-19/ai_agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Agent Framework - Hybrid Architecture

A production-grade AI agent system implementing ReAct (Reasoning + Acting) and Reflexion (Self-Correction) architectures from first principles.
Built for modularity, low latency, and token-cost efficiency.

Python 3.9+ License MIT Status v0.2 Stable


🎯 What This Is

An autonomous AI agent that combines:

  • System 1 Thinking (ReAct): Fast, reactive tool execution for straightforward tasks
  • System 2 Thinking (Reflexion): Self-reflective retry strategies for complex, multi-step problems

Unlike black-box frameworks (LangChain, AutoGPT), this project gives you full control over:

  • Context window management
  • Prompt construction logic
  • Error handling and retry strategies
  • Cost-latency trade-offs

Why Build This? To bridge the gap between basic LLM tool use and robust, fault-tolerant problem-solving while maintaining transparency and control over every architectural decision.


🎬 See It In Action

Example 1: Fast ReAct Agent - Financial Calculation

ReAct Agent Demo

Task: "What is 15% of Apple's current stock price?"

Agent: ReAct (System 1 - Fast execution)

What happens:

  • 🔍 Searches web for Apple stock price
  • 🧮 Uses calculator to compute 15%
  • ⚡ Completes in ~8 seconds

Execution Time: ~4.4s | 💰 Est. Cost: $0.008 | 🔄 Trials: 1

Use case: Quick, straightforward tasks where speed matters


Example 2: Smart Reflexion Agent - Current Events

Reflexion Agent Demo

Task: "Who won the latest Formula 1 race?"

Agent: Reflexion (System 2 - Self-correcting)

What happens:

  1. Trial 1: Actor searches for latest F1 race results
  2. Evaluator: Validates the information is current and accurate
  3. Success on first try - Answer verified and returned

Execution Time: ~10.5s | 💰 Est. Cost: $0.014 | 🔄 Trials: 1 ✅

Note: Even though Reflexion succeeded on the first trial, the evaluation step still ran to verify correctness—this is why it's slightly slower than pure ReAct but more reliable.

Use case: Complex queries requiring verification and multi-step reasoning


Comparison: ReAct vs Reflexion

Metric ReAct (demo.gif) Reflexion (demo1.gif) Trade-off
Speed 4.4s 10.5s Reflexion +23% slower
Cost $0.008 $0.014 Reflexion +75% costlier
Reliability Medium High Reflexion has self-correction
Best For Simple queries Complex/ambiguous tasks

What's Happening Behind the Scenes

ReAct Flow (demo.gif):

  1. Actor (Groq/Llama 3.3): Executes Thought-Action-Observation loop
  2. Fast tool execution (<2s per step)
  3. Returns answer when task complete

Reflexion Flow (demo1.gif):

  1. Actor: Executes ReAct loop
  2. Evaluator (Gemini 2.5): Judges if trajectory is correct
  3. Reflector: On failure, performs root cause analysis
  4. Memory: Injects lessons into next retry attempt
  5. Returns verified answer or admits failure after max trials

🏗️ System Architecture: The "Dual-Brain" Design

We treat model selection as an architectural decision, separating fast execution from deep reasoning.

   ┌─────────────────────────────────────────────────────────────┐
   │                   REFLEXION ORCHESTRATOR                    │
   │          (Manages trial loops & memory injection)           │
   └──────────────────────────────┬──────────────────────────────┘
                                  │
                                  ▼
                       ┌─────────────────────┐
                       │   Actor (ReAct)     │◄───────┐
                       │  Groq/Llama 3.3     │        │
                       │  (Fast & Cheap)     │        │
                       └──────────┬──────────┘        │
                                  │                   │
                                  │ Trajectory        │
                                  ▼                   │
                       ┌─────────────────────┐        │
                       │     Evaluator       │        │
                       │   Gemini 2.5 Flash  │        │
                       │  (Slow & Smart)     │        │
                       └──────────┬──────────┘        │
                                  │                   │
                  ┌───────────────┴───────────────┐   │
              (Success)                       (Failure)│
                  │                               │   │
                  ▼                               ▼   │
           ┌────────────┐              ┌───────────────────┐
           │ Return Ans │              │    Reflector      │
           └────────────┘              │ Root Cause + Plan │
                                       └─────────┬─────────┘
                                                 │
                                                 ▼
                                       ┌───────────────────┐
                                       │ Episodic Memory   │
                                       │ (Injected Next Try)│
                                       └───────────────────┘

Key Components

Component Role Implementation
Actor Executes Thought-Action-Observation loops Groq API (Llama 3.3-70B, 300+ tok/s)
Evaluator Judges success/failure of trajectories Google Gemini 2.5 (Native SDK)
Reflector Performs root cause analysis on failures Structured prompt templates
Memory Stores lessons across retry attempts Episodic buffer (v0.3: Vector DB)
Tools External capabilities (search, math, web scraping) Abstract base class for extensibility

🛡️ Production Engineering: Lessons from the Trenches

Building an agent is easy. Making it reliable is hard. Here's what we learned shipping this to real workloads.

1. The Model Drift Problem

Challenge: Migrating from local Llama to cloud APIs revealed "chatter"—models would output <think> blocks, markdown fences, or conversational asides that broke JSON parsers.

War Story: On Oct 15, 2025, we hit a 100% crash rate when Llama 3.3 started prefixing responses with Let me think through this... instead of valid JSON.

Solution: Built a battle-hardened parser with regex stripping:

# Strips: <think>, ```json, markdown, extra whitespace
def extract_action(text: str) -> dict:
    text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
    text = re.sub(r'```json\s*|\s*```', '', text)
    # ... 6 more sanitization layers

Result: Zero parser crashes in 2000+ test runs across 4 model providers.


2. The Context Window Firehose

Challenge: Tools like web_browse would dump 50k+ tokens of raw HTML into context, causing:

  • "Lost in the Middle" syndrome (LLM ignores key facts)
  • Cost explosions ($0.50/query → $4.20/query)
  • Context limit crashes on 8k window models

Solution: Implemented strict truncation at the tool layer:

def execute(self, url: str) -> str:
    html = fetch(url)
    text = strip_tags(html)
    return text[:5000]  # Hard limit: 5k chars

Evolution (v0.3 Roadmap):

  • Use a cheap 8B model to summarize tool output before feeding to main agent
  • Adaptive truncation based on context budget

Metrics:

  • Cost reduced by 73% on web-heavy tasks
  • Avg context usage: 6.2k → 2.8k tokens

3. API Reliability & Dependency Chaos

Challenge:

  • Hit 429 rate limits on Groq during peak hours
  • Google silently deprecated Gemini 1.5 endpoints (404 errors)
  • Local Ollama server crashes on OOM

Solutions:

  1. Exponential backoff with jitter (3 retries, max 16s wait)
  2. Provider abstraction layer for hot-swapping:
    llm = LLMFactory.create(provider="groq")  # Or "google", "ollama"
  3. Native SDK migration for Google (decoupled from OpenAI compatibility shim)

Learning: Never assume external APIs are stable. Always have a fallback path.


4. The Cost-Intelligence Trade-off

Benchmark Discovery: Running full Reflexion on simple tasks is wasteful.

Task ReAct (System 1) Reflexion (System 2) Overhead
2 + 2 1.7s, $0.001 7.3s, $0.004 4.3× slower, 4× costlier
Complex Finance 32.8s, $0.02 43.7s, $0.03 1.3× slower, 1.5× costlier

Key Insight: The evaluation step adds ~2-11s of latency. For production, we need adaptive routing:

  • Route simple queries → Fast ReAct path
  • Route ambiguous/complex queries → Reflexion path

Planned for v0.3: Task complexity classifier using embedding similarity.


📊 Performance Benchmarks (v0.2)

Tested on M1 MacBook Air, averaging 10 runs per task.

Latency Comparison

Metric ReAct (System 1) Reflexion (System 2) Notes
Simple Math 1.71s 3.49s Reflexion adds ~2s eval overhead
Complex Finance 32.77s 43.67s Reflexion adds ~11s eval overhead
Multi-Step Web 8.2s 14.6s Network I/O dominates

Robustness Comparison

Error Type ReAct Recovery Reflexion Recovery
Tool misuse ❌ (crashes) ✅ (reflects + retries)
Malformed JSON ❌ (parser error) ✅ (hardened parser)
API rate limits ⚠️ (exponential backoff) ⚠️ (same)

Cost Analysis (Estimated)

Task Complexity ReAct Reflexion Breakdown
Simple (1-2 steps) ~$0.001 ~$0.003 Eval costs 2× input tokens
Medium (3-5 steps) ~$0.008 ~$0.015 Memory injection adds context
Complex (6+ steps) ~$0.025 ~$0.040 Multiple retry loops possible

Recommendation: Use Reflexion for high-stakes tasks where correctness > speed. Use ReAct for low-latency, high-volume workloads.


🚀 Quick Start

Prerequisites

Installation

# Clone repository
git clone https://github.com/yourusername/ai-agent-framework.git
cd ai-agent-framework

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

GROQ_API_KEY=gsk_your_key_here
GOOGLE_API_KEY=AIza_your_key_here

Usage Examples

1. Fast ReAct Agent (For Simple Tasks)

python run_agent.py --agent react "What is 15% of Apple's current stock price?"

2. Smart Reflexion Agent (For Complex Logic)

python run_agent.py --agent reflexion "Who is older: King Charles III or Donald Trump?"

3. Run A/B Benchmarks

python run_comparison.py

Outputs detailed latency/cost metrics to benchmarks/results.json.


📁 Project Structure

.
├── README.md                # You are here
├── requirements.txt         # Dependencies (requests, groq, google-generativeai)
├── .env.example             # Template for API keys
├── run_agent.py             # CLI entry point
├── run_comparison.py        # A/B testing harness
├── benchmarks/              
│   └── results/             # Performance logs (JSON + charts)
└── src/
    ├── architectures/       # Core agent logic
    │   ├── react.py         # ReAct loop (Actor)
    │   └── reflexion.py     # Reflexion orchestrator
    ├── components/          # Pluggable strategy modules
    │   ├── evaluators/      # Trajectory success/failure judges
    │   ├── reflectors/      # Root cause analysis generators
    │   └── memory/          # Episodic memory (key-value store)
    ├── llm/                 # Unified LLM interface
    │   ├── factory.py       # Provider abstraction
    │   ├── groq_interface.py    # Llama 3.3 / Qwen (OpenAI SDK)
    │   └── google_interface.py  # Gemini 2.5 (Native SDK)
    ├── tools/               # External capabilities
    │   ├── base.py          # Abstract tool interface
    │   ├── general_tools.py # Calculator, search, datetime
    │   ├── web_tools.py     # Web scraper (BeautifulSoup)
    │   └── finance_tools.py # Stock price API (Alpha Vantage)
    └── utils/
        ├── parser.py        # Hardened JSON extractor
        └── logger.py        # Structured logging

🗺️ Roadmap

✅ v0.1: Foundation

  • Sequential ReAct agent implementation
  • Tool abstraction layer (Calculator, Search, Web Browse)
  • LLM interface with provider switching
  • Basic error handling

✅ v0.2: Self-Correction

  • Hybrid architecture (Groq + Google)
  • Reflexion orchestrator with trial loops
  • Battle-hardened parser (markdown/chatter stripping)
  • Benchmarking suite with cost tracking
  • Exponential backoff for API resilience

🚧 v0.3: Intelligence Layer

  • Long-Term Memory: RAG with Pinecone/Weaviate for cross-session learning
  • Adaptive Routing: Complexity classifier (embedding-based) to choose ReAct vs Reflexion
  • Semantic Caching: Redis layer to avoid redundant LLM calls
  • Tool Output Compression: Use Llama 3.2-3B to summarize verbose tool responses
  • Observability: OpenTelemetry tracing for latency profiling

📅 v0.4: Production Hardening

  • Horizontal scaling: Multi-agent task distribution
  • Tool marketplace: Dynamic tool loading (50+ tools)
  • Automated evaluation: DSPy-based scoring (replace manual validation)
  • Deployment: Docker + FastAPI + Kubernetes manifests
  • Security: Input sanitization, rate limiting, audit logs

🎓 Technical Deep Dives

Why No LangChain?

Philosophy: Zero abstraction debt.

LangChain is powerful but introduces:

  • Magic: Hidden context window management can cause silent truncation
  • Versioning Hell: Breaking changes between 0.0.x releases
  • Debugging Opacity: Nested callbacks make error traces unreadable

Our approach:

# Explicit control over every token
messages = [
    {"role": "system", "content": system_prompt},
    *history,  # Full visibility into what's in context
    {"role": "user", "content": user_query}
]
response = llm.chat(messages, max_tokens=1000)

Trade-off: We write more boilerplate but gain:

  • Precise token budgets (critical for cost optimization)
  • Deterministic behavior (no "framework magic" failures)
  • Easy migration between providers (uniform interface)

The Reflexion Algorithm Explained

Core Idea: If an agent fails, make it explain why it failed and how to fix it, then retry with that knowledge.

Pseudocode:

memory = []
for trial in range(max_trials):
    trajectory = actor.run(task, memory)  # ReAct loop
    
    if evaluator.is_correct(trajectory):
        return trajectory.final_answer
    
    reflection = reflector.analyze(trajectory)  # Root cause
    memory.append(reflection)  # Inject into next trial
    
return "Failed after all trials"

Example Reflection:

Trial 1 failed because:
- I searched "King Charles age" instead of "birth date"
- Age changes yearly; birth date is static

Next trial: Search for exact birth dates to enable comparison

Why This Works: The reflection acts as a "debugger" that teaches the agent from its own mistakes.


🤝 Contributing

Contributions welcome! Focus areas:

  • New Tools: Add domain-specific tools (SQL, file I/O, API clients)
  • Evaluators: Improve correctness checking (semantic similarity, fact verification)
  • Benchmarks: Add new test cases (especially adversarial ones)

Process:

  1. Fork the repo
  2. Create a feature branch (git checkout -b feature/amazing-tool)
  3. Write tests in tests/ (we use pytest)
  4. Commit with conventional commits (feat: add SQL query tool)
  5. Open a PR with benchmark results

📚 References & Inspiration

Papers:

Projects:


📝 License

MIT License - See LICENSE for full details.

TL;DR: Use this commercially, modify freely, just keep the license notice.

Attribution:
If you use, modify, or redistribute any part of this project, please retain attribution to the original author:

AI Agent Framework – Hybrid Architecture
Author: Riya Sangwan
Repository: https://github.com/ria-19/ai_agent


💬 Contact

Author: Riya Sangwan Email: riya.sangwandec19@example.com
LinkedIn: linkedin.com/in/riyasangwan/

Looking for Work? I'm currently seeking roles in Applied AI/ML Engineering at startups, unicorns, and top-tier tech companies. This project demonstrates:

  • Production ML system design
  • LLM orchestration & prompt engineering
  • Cost-performance optimization
  • API resilience & error handling
  • Technical writing & documentation

Open to: Full-time roles, contract work, or technical consulting in the AI agent space.


🙏 Acknowledgments

  • Groq team for LPU access and responsive API support
  • Google AI Studio for generous free tier
  • The open-source community for inspiration

Star this repo if you found it useful! ⭐
Watch releases to get notified about v0.3 (Memory + Routing).

About

A production-grade AI agent system implementing the ReAct (Reasoning + Acting) architecture from first principles. Built with modularity, extensibility, and cost-efficiency in mind.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages