Stop forcing LLMs to answer in one pass. Give them a runtime.
Took a base model. Wrapped it in a tiny recursive loop: generate code - execute - refine - repeat.
Didn't change the model. Didn't add training. Didn't add data.
Just stopped forcing it to answer in one pass.
The performance jump is not subtle:
| Vanilla (one-shot) | minRLM (recursive) | |
|---|---|---|
| AIME 2025 | 0% | 96% |
| Sudoku Extreme | 0% | 80% |
| Overall (GPT-5.2) | 48.2% | 78.2% (+30pp) |
| Tokens used | 20,967 | 8,151 (3.6x less) |
| Cost | $7.92 | $2.86 (2.8x cheaper) |
6,600+ evaluations across 4 models and 13 tasks. Full blog post | Detailed results
pip install minrlm
export OPENAI_API_KEY="sk-..."
# Analyze a file - data never enters the prompt
uvx minrlm "How many ERROR lines in the last hour?" ./server.log
# Pure computation - the REPL writes the algorithm
uvx minrlm "Return all primes up to 1,000,000, reversed."
# -> 78,498 primes in 6,258 tokens. Output: 616K chars. 25x savings.
# Pipe anything
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"
# Chain: solve a Sudoku, then pipe the solution to verify it
uvx minrlm -s "Solve this Sudoku:
..3|.1.|...
.4.|...|8..
...|..6|.2.
---+---+---
.8.|.5.|..1
...|...|...
5..|.8.|.6.
---+---+---
.7.|6..|...
..2|...|.5.
...|.3.|9.." \
| uvx minrlm -s 'Verify this sudoku board, is it valid? return {"board":str, "valid": bool}'from minrlm import RLM
rlm = RLM(model="gpt-5-mini")
# 50MB CSV? Same cost as 5KB. Data never enters the prompt.
answer = rlm.completion(
task="Which product had the highest return rate in Q3?",
context=open("q3_returns.csv").read()
)Standard LLM:
[System prompt] + [500K tokens of raw context] + [Question]
= Expensive. Slow. Accuracy degrades with length.
minRLM:
input_0 = "<500K chars in REPL memory>" # never in prompt
LLM writes: errors = [l for l in input_0.splitlines() if "ERROR" in l]
FINAL(len(errors))
= Code runs. Answer returned. ~4K tokens total.
The model writes Python to query the data. Attention runs only on the results. A 7M-character document costs the same as a 7K one.
Not ReAct. One REPL, 1-2 iterations, no growing context. Every step is Python you can read, rerun, and debug.
- Entropy profiling - zlib compression heatmap of the input. A needle in 7MB shows up as an entropy spike; the model skips straight to it
- Task routing - auto-detects structured data, MCQ, code retrieval, math, search & extract. Each gets a specialized code pattern
- Two-pass search - if the first pass returns "unknown", a second pass runs with keywords from first-pass evidence
- Sub-LLM delegation - outer model gathers evidence via
search(), passes it tosub_llm(task, evidence)for focused reasoning - Flat token cost - context never enters the conversation. Only the entropy map and a head/mid/tail preview do
- DockerREPL - every execution in a sandboxed container with seccomp. No network, no filesystem, stdlib only
The REPL isn't a crutch for weak models - it's a lever that better models pull harder.
| Model | minRLM | Vanilla | Gap | Tasks won |
|---|---|---|---|---|
| GPT-5-nano (small) | 53.7% | 63.2% | -9.5 | 4/12 |
| GPT-5-mini (mid) | 72.7% | 69.5% | +3.2 | 7/12 |
| GPT-5.4-mini (mid, newer) | 69.5% | 47.2% | +22.3 | 8/12 |
| GPT-5.2 (frontier) | 78.2% | 48.2% | +30.0 | 11/12 |
Small model? Recursion adds overhead. Frontier model? Recursion dominates.
The gap isn't model size. It's the execution model.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Use it when:
- Large context (docs, logs, CSV, JSON) - cost stays flat as data grows
- You want debuggable reasoning - every step is readable Python, not hidden attention
- Token efficiency matters - 3.6x fewer tokens than comparable approaches
Skip it when:
- Short context (<8K tokens) - a direct call is simpler
- Code retrieval (RepoQA) - the one task where vanilla wins everywhere
- You need third-party packages - the sandbox is stdlib-only
| Function | What it does |
|---|---|
input_0 |
Your context data (string, never in the prompt) |
search(text, pattern) |
Substring search with context windows |
sub_llm(task, context) |
Recursive LLM call on a sub-chunk |
FINAL(answer) |
Return answer and stop |
# Local / self-hosted
rlm = RLM(model="llama-3.1-70b", base_url="http://localhost:8000/v1")
# Hugging Face
from openai import OpenAI
hf = OpenAI(base_url="https://router.huggingface.co/v1", api_key="hf_...")
rlm = RLM(model="openai/gpt-oss-120b", client=hf)Works with: OpenAI, Hugging Face, Anthropic (via proxy), vLLM, Ollama, LiteLLM, or anything OpenAI-compatible.
Visualizer (Gradio UI)
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra visualizer
uv run python examples/visualizer.py # http://localhost:7860OpenCode integration
1. Start the proxy:
uv run --with ".[proxy]" examples/proxy.py
# RLM Proxy initialized | model=gpt-5-mini | docker=False
# Uvicorn running on http://0.0.0.0:80002. Config (opencode/opencode.json): set provider.minrlm.api to http://localhost:8000/v1. See opencode/opencode.json.
3. Run:
OPENCODE_CONFIG=opencode.json opencode run "First prime after 1 million"
# > 1000003Docker sandbox
LLM-generated code runs in isolated Docker containers. No network, read-only filesystem, memory-capped, seccomp-filtered.
rlm = RLM(model="gpt-5-mini", use_docker=True, docker_memory="256m")Run the benchmarks yourself
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra eval
# Smoke test
uv run python eval/quickstart.py
# Full benchmark (reproduces the tables above)
uv run python eval/run.py \
--tasks all \
--runners minrlm-reasoning,vanilla,official \
--runs 50 --parallel 12 --task-parallel 12 \
--output-dir logs/my_evalFull results: eval/README.md
Examples
uv run python examples/minimal.py # vanilla vs RLM side-by-side
uv run python examples/advanced_usage.py # search, sub_llm, callbacks
uv run python examples/visualizer.py # Gradio UI
uv run uvicorn examples.proxy:app --port 8000 # OpenAI-compatible proxyContext window rot is real - model accuracy degrades as input grows, even when the answer is right there. Bigger windows aren't the fix. Less input, better targeted, is.
The same pattern is showing up everywhere: Anthropic's web search tool writes code to filter results, MCP standardizes code execution access, smolagents goes further. They all converge on the same idea: let the model use code to work with data instead of attending to all of it.
Feels less like "prompting" and more like giving the model a runtime.
- More models - Claude Opus 4.6, Gemini 2.5, open-weight models. Does the scaling trend hold across providers?
- Agentic pipelines - using the RLM pattern as a retrieval step inside multi-step agent workflows
- More tasks - stress-testing edge cases and domains where the approach might break
Contributions welcome. Open an issue or PR.
Built by Avi Lumelsky. Independent implementation - not a fork.
The RLM concept comes from Zhang, Kraska, and Khattab (2025). Official implementation: github.com/alexzhang13/rlm.
Citation
@misc{zhang2026recursivelanguagemodels,
title={Recursive Language Models},
author={Alex L. Zhang and Tim Kraska and Omar Khattab},
year={2026},
eprint={2512.24601},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.24601},
}
MIT






