A systematic comparison of three document intelligence paradigms: traditional Retrieval-Augmented Generation (RAG), Google's File Search-based RAG, and Recursive Language Models (RLM).
This repository implements three distinct approaches for question-answering over long documents, evaluated on the Vectara Open RAG Benchmark (3,045 questions across 1,000 research papers).
| Approach | Architecture | Context Strategy | Model |
|---|---|---|---|
| Normal RAG | Chunking + Embedding + Top-K Retrieval | Pre-indexed chunks (384-dim vectors) | Qwen3-30B via OpenRouter |
| Google RAG | File Search + Grounding | Google-managed indexing | Gemini 2.5 Flash |
| RLM | REPL + Recursive Sub-LM Calls | Dynamic context mining | Qwen3-30B via OpenRouter |
Recent work on Recursive Language Models [1] has demonstrated a novel paradigm for processing long-context tasks without requiring massive context windows. However, there remains confusion in the community about the relationship between RLMs and traditional RAG systems. This implementation provides:
- Direct comparison of RLM against established RAG baselines
- Full transparency into each system's reasoning process (chain-of-thought traces)
- Benchmark evaluation on realistic document QA tasks
- Pre-indexes documents into fixed-size chunks
- Retrieves top-K most similar chunks via embedding similarity
- Single LLM call with retrieved context
- Managed indexing via Google File Search
- Grounding-based retrieval with source attribution
- Single LLM call with grounded context
- No pre-indexing — processes documents on-demand
- Iterative reasoning via REPL environment (Python sandbox)
- Multiple sub-LM calls to extract and verify information
- Programmatic context navigation (search, filter, chunk dynamically)
┌─────────────────────────────────────────────────────────────┐
│ User Query │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Normal RAG │ │ Google RAG │ │ RLM │
├───────────────┤ ├───────────────┤ ├───────────────┤
│ 1. Embed Q │ │ 1. Upload PDF │ │ 1. Load doc │
│ 2. Top-K │ │ 2. File Search│ │ as variable│
│ retrieval │ │ grounding │ │ 2. LLM writes │
│ 3. LLM call │ │ 3. LLM call │ │ Python code│
│ (Qwen3) │ │ (Gemini) │ │ 3. Sub-LM │
│ │ │ │ │ calls │
│ │ │ │ │ 4. Iterate │
│ │ │ │ │ 5. SUBMIT() │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌───────────────┐
│ Side-by-Side │
│ Comparison │
└───────────────┘
- Python 3.8+
- Deno (required for RLM's sandboxed REPL)
-
Clone the repository
git clone https://github.com/yourusername/RLM-RAG.git cd RLM-RAG -
Install Python dependencies
pip install -r requirements.txt
-
Install Deno (for RLM sandbox)
curl -fsSL https://deno.land/install.sh | shRestart your shell after installation.
-
Configure API keys
cp .env.example .env
Edit
.envwith your credentials:OPENROUTER_API_KEY=your_openrouter_key GOOGLE_API_KEY=your_google_key REDUCTO_API_KEY=your_reducto_key
-
Run the application
python app.py
Navigate to
http://localhost:5000
| Service | Purpose | Obtain Key |
|---|---|---|
| OpenRouter | Normal RAG + RLM (Qwen3) | openrouter.ai |
| Google AI | Google RAG (Gemini) | ai.google.dev |
| Reducto | PDF parsing (OCR) | reducto.ai |
Model selection can be configured via environment variables:
QWEN_MODEL=qwen/qwen3-30b-a3b # OpenRouter model for RAG + RLM
GOOGLE_MODEL=gemini-2.5-flash # Google model for File SearchRLM-RAG/
├── app.py # Flask backend
├── pipelines/
│ ├── normal_rag.py # Traditional RAG implementation
│ ├── google_rag.py # Google File Search RAG
│ ├── rlm_pipeline.py # DSPy RLM implementation
│ └── reducto_parser.py # PDF parsing with Reducto OCR
├── benchmark/
│ ├── bench_loader.py # Vectara Open RAG Bench loader
│ └── data/ # Benchmark questions & answers
├── templates/
│ └── index.html # Web interface
├── static/
│ ├── style.css # UI styling
│ └── script.js # Frontend logic
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md
-
Document Processing
- PDF parsed via Reducto OCR (extracts text, tables, figures)
- Text chunked into ~1000-character segments (200-char overlap)
- Chunks embedded using
sentence-transformers/all-MiniLM-L6-v2(384-dim)
-
Query Processing
- Question embedded using same model
- Top-5 chunks retrieved via cosine similarity
- Context + question sent to Qwen3-30B
-
Document Processing
- Raw PDF uploaded to Google File Search Store
- Google internally indexes and chunks the document
-
Query Processing
- Gemini 2.5 Flash uses
file_searchtool - Retrieves relevant passages with grounding metadata
- Returns answer with source attribution
- Gemini 2.5 Flash uses
-
Document Processing
- Full document text loaded as Python variable in REPL
- No pre-indexing or chunking
-
Query Processing
- LLM receives question + metadata (doc length, structure)
- LLM writes Python code to explore document:
peek(start, end)— view text slicesearch(keyword)— find occurrencesllm_query(text, question)— spawn sub-LM call
- Iterates until sufficient information gathered
- Calls
SUBMIT(answer)to finalize
-
Recursive Reasoning
- Main LLM never sees full document (avoids context overflow)
- Sub-LM calls process specific chunks
- Full trajectory (code + outputs) displayed for transparency
The application includes the Vectara Open RAG Benchmark [2]:
- 3,045 questions across 1,000 research papers (arXiv)
- Question types: Abstractive (1,793), Extractive (1,252)
- Source types: Text (1,914), Text+Image (763), Text+Table (148), Text+Table+Image (220)
Each pipeline can be evaluated on benchmark questions with:
- Ground truth answers for comparison
- LLM-as-Judge scoring (Claude Sonnet 4 via OpenRouter)
- Metrics: Accuracy, Completeness, Relevance (1-10 scale)
[1] Zhang, A., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv preprint arXiv:2512.24601. https://arxiv.org/abs/2512.24601
[2] Vectara. (2024). Open RAG Bench: A Benchmark for Retrieval-Augmented Generation. https://github.com/vectara/open-rag-bench
[3] Khattab, O., et al. (2024). DSPy: Programming—not prompting—Foundation Models. https://dspy.ai
If you use this implementation in your research, please cite the original RLM paper:
@article{zhang2025recursive,
title={Recursive Language Models},
author={Zhang, Alex and Kraska, Tim and Khattab, Omar},
journal={arXiv preprint arXiv:2512.24601},
year={2025},
url={https://arxiv.org/abs/2512.24601}
}MIT License
- DSPy team for the RLM implementation
- Vectara for the Open RAG Benchmark
- Reducto for PDF parsing infrastructure