Empirical research framework for comparing iterative reflection vs model capability in AI systems
- Quick Setup Guide - Get running in 5 minutes
- API Usage Examples - Test the system immediately
- 📖 Comprehensive Build Summary - Complete technical documentation (1,240 lines)
- 📖 Use Case Extension Guide - Beginner-friendly development guide (2,000+ lines)
- 📖 Research Design - 4-week empirical study proposal
- 📖 Reflection Pattern Theory - Producer-Critic implementation guide
- 📖 Port Allocation Strategy - Container organization
- 📖 Environment Setup - Configuration template
- 📖 Experiment Config - Research experiment setup
- 📖 Complete Codebase Map - Detailed file-by-file navigation guide
- base_agent.py - Abstract agent class with ADK integration
- base_orchestrator.py - Reflection workflow management
- base_evaluator.py - Multi-dimensional quality assessment
- agents.py - Producer & Critic agents (GCP architect + reviewer)
- config.py - Quality dimensions & test scenarios
- tools/cloud_pricing.py - Multi-cloud pricing tools
- docker/ - Independent container configuration
- main.py - Main orchestrator (Port 8000)
- use_case_server.py - Use case specific server (Port 8001+)
- settings.py - Global framework configuration
- experiments/ - Research experiment definitions
This framework addresses the fundamental research question: Can iterative reflection with lower-capability models match or exceed the performance of single-pass higher-capability models while maintaining cost efficiency?
Default Model: gemini-2.5-flash-lite
- No thinking capability by default - provides clean baseline without internal reasoning
- Lower cost - perfect for testing cost-effectiveness of reflection
- Clear signal - when reflection improves Flash-Lite output, it demonstrates the power of iterative improvement
Research Hypothesis: Flash-Lite + Reflection ≥ Pro (single-pass) in quality while maintaining cost efficiency.
| Approach | Model | Mode | Quality Score | Processing Time | Response Length | Iterations |
|---|---|---|---|---|---|---|
| Flash-Lite + Baseline | gemini-2.5-flash-lite | baseline | 0.5 | 131.4s | 2,239 chars | 0 |
| Flash-Lite + Reflection | gemini-2.5-flash-lite | reflection | 0.5 | 203.0s | 19,674 chars | 3 |
| Pro + Baseline | gemini-2.5-pro | baseline | 0.5 | 211.4s | 16,008 chars | 0 |
| Pro + Reflection | gemini-2.5-pro | reflection | 0.5 | 259.6s | 15,116 chars | 1 |
- ✅ Quality Consistency: All approaches achieve quality score of 0.5
- ✅ Reflection Efficiency: Pro model converges in 1 iteration vs 3 for Flash-Lite
- ✅ Content Volume: Reflection produces significantly more comprehensive output
- ✅ Model Performance: Pro model generates more detailed responses (16K vs 2K chars)
- ✅ Smart Termination: Critic approval prevents unnecessary iterations
- ✅ System Stability: All models and modes working reliably
- Modular Architecture: Pluggable use cases with consistent research infrastructure
- Producer-Critic Pattern: Implements reflection patterns from AI research
- Multi-dimensional Evaluation: Comprehensive quality assessment across multiple criteria
- Cost Tracking: Detailed analysis of API costs and token usage
- Expert Integration: Framework for incorporating domain expert evaluations
- Statistical Analysis: Built-in significance testing and comparative analysis
- Python 3.11+
- Docker and Docker Compose
- Google AI API key
- Clone and setup environment:
git clone <repository-url>
cd agentic-research-framework
cp env.example .env
# Edit .env with your Google API key (REQUIRED)- Install dependencies with UV:
# Install UV if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync --all-extras- Start the research framework:
# Option 1: All use cases with Docker
docker-compose up -d
# Option 2: Specific use case only
docker-compose up -d system_design
# Option 3: With research environments
docker-compose --profile research up -d
# Option 4: Direct Python (single use case)
USE_CASE=system_design API_PORT=8001 uvicorn api.use_case_server:app --reload- Access the APIs:
- Main Orchestrator: http://localhost:8000/docs
- System Design: http://localhost:8001/docs (dedicated container)
- System Design Research: http://localhost:8891 (Jupyter Lab)
- Health Checks: All endpoints have
/healthfor monitoring
# Test baseline mode (single-pass with quality evaluation)
curl -X POST "http://localhost:8001/chat" \
-H "Content-Type: application/json" \
-d '{
"message": "Design a web application for 10k users",
"mode": "baseline",
"model": "gemini-2.5-flash-lite"
}'
# Test reflection mode (producer-critic iterative improvement)
curl -X POST "http://localhost:8001/chat" \
-H "Content-Type: application/json" \
-d '{
"message": "Design a web application for 10k users",
"mode": "reflection",
"model": "gemini-2.5-flash-lite",
"reflection_iterations": 3
}'
# Test different models for comparison
curl -X POST "http://localhost:8001/chat" \
-H "Content-Type: application/json" \
-d '{
"message": "Design a scalable e-commerce platform",
"mode": "baseline",
"model": "gemini-2.5-pro"
}'
# Compare modes side-by-side (system design container)
curl -X POST "http://localhost:8001/chat/compare" \
-H "Content-Type: application/json" \
-d '{
"message": "Design a scalable e-commerce platform",
"model": "gemini-2.5-flash-lite"
}'
# Health checks and system info
curl "http://localhost:8001/health" # System health
curl "http://localhost:8001/info" # Model and capability info// Chat with system design use case (dedicated container)
const response = await fetch('http://localhost:8001/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
message: "Design a system for Black Friday traffic",
mode: "reflection",
model: "gemini-2.5-flash-lite",
reflection_iterations: 3
})
});
const result = await response.json();
console.log(result.response); // Agent's system design
console.log(result.quality_score); // Quality evaluation
console.log(result.use_case); // "system_design"
console.log(result.resource_usage); // Container and isolation info
// Compare approaches in isolated environment
const comparison = await fetch('http://localhost:8001/chat/compare', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
message: "Design a scalable microservices architecture"
})
});framework/base_agent.py: Abstract base class for all agentsframework/base_orchestrator.py: Workflow orchestration with reflection patternsframework/base_evaluator.py: Multi-dimensional quality evaluation
use_cases/system_design/: Cloud architecture design with cost comparison- Agents: Producer (generates designs), Critic (evaluates and suggests improvements)
- Tools: GCP/AWS/Azure pricing, security analysis, scaling calculations
- Evaluator: 6-dimensional quality assessment
research/experiment_orchestrator.py: Systematic experiment executionresearch/results_analyzer.py: Statistical analysis and visualization- Cost tracking and performance monitoring
The system design use case specializes in:
- GCP-focused architecture with multi-cloud cost comparison
- Scalability analysis for traffic spikes (e.g., Black Friday scenarios)
- Security and compliance assessment (PCI DSS, GDPR, SOX)
- Cost optimization across cloud providers
- Expert evaluation integration for validation
- Technical Accuracy (25%): Correctness of technical decisions
- Cost Optimization (20%): Cost efficiency across providers
- Security Posture (20%): Security best practices and compliance
- Scalability Design (15%): Ability to handle growth and peak loads
- Completeness (10%): Coverage of all requirements
- Clarity (10%): Documentation and explanation quality
- Models: Test across Gemini Flash, Flash-Lite, and Pro
- Reflection Configs: Baseline (0), Light (2-3), Deep (5+) iterations
- Scenarios: Simple web apps to complex enterprise systems
- Evaluation: Multi-dimensional quality + cost analysis
- Repetitions: Multiple runs for statistical significance
- Quality: At what point does reflection-enhanced Flash match Pro-level quality?
- Efficiency: What are the cost and speed trade-offs?
- Task Dependency: Which types of tasks benefit most from reflection?
- Convergence: How many reflection cycles are optimal?
-
Create use case directory:
use_cases/your_use_case/ -
Implement required components:
config.py: Use case configuration and test scenariosagents.py: Producer and Critic agentsorchestrator.py: Workflow orchestrationevaluator.py: Quality evaluationtools/: Domain-specific tools
-
Follow naming conventions for automatic discovery
# Run all tests
uv run pytest
# Run specific test category
uv run pytest tests/framework/
uv run pytest tests/use_cases/# Format code
uv run black .
uv run ruff check --fix .
# Type checking
uv run mypy .Key configuration options in .env:
# Google AI
GOOGLE_API_KEY=your_key_here
DEFAULT_MODEL=gemini-2.5-flash
PRO_MODEL=gemini-2.5-pro
# Research
ENABLE_RESEARCH_MODE=true
MAX_REFLECTION_ITERATIONS=5
COST_TRACKING_ENABLED=true
# Database & Cache
RESEARCH_POSTGRES_DB=research_db
REDIS_URL=redis://localhost:6379Experiments are configured via JSON files in config/experiments/:
{
"experiment_id": "system_design_pilot_001",
"use_case": "system_design",
"models_to_test": ["gemini-2.5-flash", "gemini-2.5-pro"],
"reflection_configs": [0, 2, 3],
"test_scenarios": ["..."],
"repetitions": 2
}This framework is designed for:
- Academic Research: Empirical studies on AI reflection patterns
- Industry Analysis: Cost-benefit analysis of different AI approaches
- System Optimization: Finding optimal model/reflection combinations
- Expert Validation: Incorporating domain expertise in AI evaluation
- Fork the repository
- Create a feature branch
- Make changes following the established patterns
- Add tests for new functionality
- Submit a pull request
MIT License - see LICENSE file for details.
If you use this framework in your research, please cite:
@software{agentic_research_framework,
title={Agentic Research Framework: Iterative Reflection vs Model Capability},
author={Research Team},
year={2024},
url={https://github.com/your-org/agentic-research-framework}
}