minRLM

Stop forcing LLMs to answer in one pass. Give them a runtime.

Took a base model. Wrapped it in a tiny recursive loop: generate code - execute - refine - repeat.

Didn't change the model. Didn't add training. Didn't add data.

Just stopped forcing it to answer in one pass.

The performance jump is not subtle:

	Vanilla (one-shot)	minRLM (recursive)
AIME 2025	0%	96%
Sudoku Extreme	0%	80%
Overall (GPT-5.2)	48.2%	78.2% (+30pp)
Tokens used	20,967	8,151 (3.6x less)
Cost	$7.92	$2.86 (2.8x cheaper)

_{6,600+ evaluations across 4 models and 13 tasks. Full blog post | Detailed results}

Try it in 10 seconds

pip install minrlm
export OPENAI_API_KEY="sk-..."

# Analyze a file - data never enters the prompt
uvx minrlm "How many ERROR lines in the last hour?" ./server.log

# Pure computation - the REPL writes the algorithm
uvx minrlm "Return all primes up to 1,000,000, reversed."
# -> 78,498 primes in 6,258 tokens. Output: 616K chars. 25x savings.

# Pipe anything
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"

# Chain: solve a Sudoku, then pipe the solution to verify it
uvx minrlm -s "Solve this Sudoku:
  ..3|.1.|...
  .4.|...|8..
  ...|..6|.2.
  ---+---+---
  .8.|.5.|..1
  ...|...|...
  5..|.8.|.6.
  ---+---+---
  .7.|6..|...
  ..2|...|.5.
  ...|.3.|9.." \
  | uvx minrlm -s 'Verify this sudoku board, is it valid? return {"board":str, "valid": bool}'

from minrlm import RLM

rlm = RLM(model="gpt-5-mini")

# 50MB CSV? Same cost as 5KB. Data never enters the prompt.
answer = rlm.completion(
    task="Which product had the highest return rate in Q3?",
    context=open("q3_returns.csv").read()
)

How it works

Standard LLM:
  [System prompt] + [500K tokens of raw context] + [Question]
  = Expensive. Slow. Accuracy degrades with length.

minRLM:
  input_0 = "<500K chars in REPL memory>"     # never in prompt
  LLM writes: errors = [l for l in input_0.splitlines() if "ERROR" in l]
              FINAL(len(errors))
  = Code runs. Answer returned. ~4K tokens total.

The model writes Python to query the data. Attention runs only on the results. A 7M-character document costs the same as a 7K one.

Not ReAct. One REPL, 1-2 iterations, no growing context. Every step is Python you can read, rerun, and debug.

What makes it work

Entropy profiling - zlib compression heatmap of the input. A needle in 7MB shows up as an entropy spike; the model skips straight to it
Task routing - auto-detects structured data, MCQ, code retrieval, math, search & extract. Each gets a specialized code pattern
Two-pass search - if the first pass returns "unknown", a second pass runs with keywords from first-pass evidence
Sub-LLM delegation - outer model gathers evidence via search(), passes it to sub_llm(task, evidence) for focused reasoning
Flat token cost - context never enters the conversation. Only the entropy map and a head/mid/tail preview do
DockerREPL - every execution in a sandboxed container with seccomp. No network, no filesystem, stdlib only

The scaling story

The REPL isn't a crutch for weak models - it's a lever that better models pull harder.

Model	minRLM	Vanilla	Gap	Tasks won
GPT-5-nano (small)	53.7%	63.2%	-9.5	4/12
GPT-5-mini (mid)	72.7%	69.5%	+3.2	7/12
GPT-5.4-mini (mid, newer)	69.5%	47.2%	+22.3	8/12
GPT-5.2 (frontier)	78.2%	48.2%	+30.0	11/12

Small model? Recursion adds overhead. Frontier model? Recursion dominates.

The gap isn't model size. It's the execution model.

When to use it (and when not to)

Use it when:

Large context (docs, logs, CSV, JSON) - cost stays flat as data grows
You want debuggable reasoning - every step is readable Python, not hidden attention
Token efficiency matters - 3.6x fewer tokens than comparable approaches

Skip it when:

Short context (<8K tokens) - a direct call is simpler
Code retrieval (RepoQA) - the one task where vanilla wins everywhere
You need third-party packages - the sandbox is stdlib-only

REPL tools

Function	What it does
`input_0`	Your context data (string, never in the prompt)
`search(text, pattern)`	Substring search with context windows
`sub_llm(task, context)`	Recursive LLM call on a sub-chunk
`FINAL(answer)`	Return answer and stop

Works with any OpenAI-compatible endpoint

# Local / self-hosted
rlm = RLM(model="llama-3.1-70b", base_url="http://localhost:8000/v1")

# Hugging Face
from openai import OpenAI
hf = OpenAI(base_url="https://router.huggingface.co/v1", api_key="hf_...")
rlm = RLM(model="openai/gpt-oss-120b", client=hf)

Works with: OpenAI, Hugging Face, Anthropic (via proxy), vLLM, Ollama, LiteLLM, or anything OpenAI-compatible.

More ways to run

Visualizer (Gradio UI)

git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra visualizer
uv run python examples/visualizer.py   # http://localhost:7860

OpenCode integration

1. Start the proxy:

uv run --with ".[proxy]" examples/proxy.py
# RLM Proxy initialized | model=gpt-5-mini | docker=False
# Uvicorn running on http://0.0.0.0:8000

2. Config (opencode/opencode.json): set provider.minrlm.api to http://localhost:8000/v1. See opencode/opencode.json.

3. Run:

OPENCODE_CONFIG=opencode.json opencode run "First prime after 1 million"
# > 1000003

Full tutorial

Docker sandbox

LLM-generated code runs in isolated Docker containers. No network, read-only filesystem, memory-capped, seccomp-filtered.

rlm = RLM(model="gpt-5-mini", use_docker=True, docker_memory="256m")

Run the benchmarks yourself

git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra eval

# Smoke test
uv run python eval/quickstart.py

# Full benchmark (reproduces the tables above)
uv run python eval/run.py \
    --tasks all \
    --runners minrlm-reasoning,vanilla,official \
    --runs 50 --parallel 12 --task-parallel 12 \
    --output-dir logs/my_eval

Full results: eval/README.md

Examples

uv run python examples/minimal.py            # vanilla vs RLM side-by-side
uv run python examples/advanced_usage.py     # search, sub_llm, callbacks
uv run python examples/visualizer.py         # Gradio UI
uv run uvicorn examples.proxy:app --port 8000  # OpenAI-compatible proxy

Why this matters

Context window rot is real - model accuracy degrades as input grows, even when the answer is right there. Bigger windows aren't the fix. Less input, better targeted, is.

The same pattern is showing up everywhere: Anthropic's web search tool writes code to filter results, MCP standardizes code execution access, smolagents goes further. They all converge on the same idea: let the model use code to work with data instead of attending to all of it.

Feels less like "prompting" and more like giving the model a runtime.

Future work

More models - Claude Opus 4.6, Gemini 2.5, open-weight models. Does the scaling trend hold across providers?
Agentic pipelines - using the RLM pattern as a retrieval step inside multi-step agent workflows
More tasks - stress-testing edge cases and domains where the approach might break

Contributions welcome. Open an issue or PR.

Credits

Built by Avi Lumelsky. Independent implementation - not a fork.

The RLM concept comes from Zhang, Kraska, and Khattab (2025). Official implementation: github.com/alexzhang13/rlm.

Citation

@misc{zhang2026recursivelanguagemodels,
      title={Recursive Language Models},
      author={Alex L. Zhang and Tim Kraska and Omar Khattab},
      year={2026},
      eprint={2512.24601},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.24601},
}

Star History

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.claude		.claude
BEST_EVALS		BEST_EVALS
docs		docs
eval		eval
examples		examples
minrlm		minrlm
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
lint.sh		lint.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minRLM

Try it in 10 seconds

How it works

What makes it work

The scaling story

When to use it (and when not to)

REPL tools

Works with any OpenAI-compatible endpoint

More ways to run

Why this matters

Future work

Credits

Star History

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

minRLM

Try it in 10 seconds

How it works

What makes it work

The scaling story

When to use it (and when not to)

REPL tools

Works with any OpenAI-compatible endpoint

More ways to run

Why this matters

Future work

Credits

Star History

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages