Skip to content

avilum/minrlm

Repository files navigation

minRLM

Stop forcing LLMs to answer in one pass. Give them a runtime.

PyPI Stars MIT License Blog Post

minRLM demo - LLM writes code, REPL executes, answer returned

Took a base model. Wrapped it in a tiny recursive loop: generate code - execute - refine - repeat.

Didn't change the model. Didn't add training. Didn't add data.

Just stopped forcing it to answer in one pass.

The performance jump is not subtle:

Vanilla (one-shot) minRLM (recursive)
AIME 2025 0% 96%
Sudoku Extreme 0% 80%
Overall (GPT-5.2) 48.2% 78.2% (+30pp)
Tokens used 20,967 8,151 (3.6x less)
Cost $7.92 $2.86 (2.8x cheaper)

6,600+ evaluations across 4 models and 13 tasks. Full blog post | Detailed results


Try it in 10 seconds

pip install minrlm
export OPENAI_API_KEY="sk-..."

# Analyze a file - data never enters the prompt
uvx minrlm "How many ERROR lines in the last hour?" ./server.log

# Pure computation - the REPL writes the algorithm
uvx minrlm "Return all primes up to 1,000,000, reversed."
# -> 78,498 primes in 6,258 tokens. Output: 616K chars. 25x savings.

# Pipe anything
cat huge_dataset.csv | uvx minrlm "Which product had the highest return rate?"

# Chain: solve a Sudoku, then pipe the solution to verify it
uvx minrlm -s "Solve this Sudoku:
  ..3|.1.|...
  .4.|...|8..
  ...|..6|.2.
  ---+---+---
  .8.|.5.|..1
  ...|...|...
  5..|.8.|.6.
  ---+---+---
  .7.|6..|...
  ..2|...|.5.
  ...|.3.|9.." \
  | uvx minrlm -s 'Verify this sudoku board, is it valid? return {"board":str, "valid": bool}'
from minrlm import RLM

rlm = RLM(model="gpt-5-mini")

# 50MB CSV? Same cost as 5KB. Data never enters the prompt.
answer = rlm.completion(
    task="Which product had the highest return rate in Q3?",
    context=open("q3_returns.csv").read()
)

How it works

Standard LLM:
  [System prompt] + [500K tokens of raw context] + [Question]
  = Expensive. Slow. Accuracy degrades with length.

minRLM:
  input_0 = "<500K chars in REPL memory>"     # never in prompt
  LLM writes: errors = [l for l in input_0.splitlines() if "ERROR" in l]
              FINAL(len(errors))
  = Code runs. Answer returned. ~4K tokens total.

The model writes Python to query the data. Attention runs only on the results. A 7M-character document costs the same as a 7K one.

Not ReAct. One REPL, 1-2 iterations, no growing context. Every step is Python you can read, rerun, and debug.

What makes it work

  • Entropy profiling - zlib compression heatmap of the input. A needle in 7MB shows up as an entropy spike; the model skips straight to it
  • Task routing - auto-detects structured data, MCQ, code retrieval, math, search & extract. Each gets a specialized code pattern
  • Two-pass search - if the first pass returns "unknown", a second pass runs with keywords from first-pass evidence
  • Sub-LLM delegation - outer model gathers evidence via search(), passes it to sub_llm(task, evidence) for focused reasoning
  • Flat token cost - context never enters the conversation. Only the entropy map and a head/mid/tail preview do
  • DockerREPL - every execution in a sandboxed container with seccomp. No network, no filesystem, stdlib only

The scaling story

The REPL isn't a crutch for weak models - it's a lever that better models pull harder.

Model minRLM Vanilla Gap Tasks won
GPT-5-nano (small) 53.7% 63.2% -9.5 4/12
GPT-5-mini (mid) 72.7% 69.5% +3.2 7/12
GPT-5.4-mini (mid, newer) 69.5% 47.2% +22.3 8/12
GPT-5.2 (frontier) 78.2% 48.2% +30.0 11/12

Small model? Recursion adds overhead. Frontier model? Recursion dominates.

The gap isn't model size. It's the execution model.

Summary Accuracy Tokens
Cost Latency Per Task

When to use it (and when not to)

Use it when:

  • Large context (docs, logs, CSV, JSON) - cost stays flat as data grows
  • You want debuggable reasoning - every step is readable Python, not hidden attention
  • Token efficiency matters - 3.6x fewer tokens than comparable approaches

Skip it when:

  • Short context (<8K tokens) - a direct call is simpler
  • Code retrieval (RepoQA) - the one task where vanilla wins everywhere
  • You need third-party packages - the sandbox is stdlib-only

REPL tools

Function What it does
input_0 Your context data (string, never in the prompt)
search(text, pattern) Substring search with context windows
sub_llm(task, context) Recursive LLM call on a sub-chunk
FINAL(answer) Return answer and stop

Works with any OpenAI-compatible endpoint

# Local / self-hosted
rlm = RLM(model="llama-3.1-70b", base_url="http://localhost:8000/v1")

# Hugging Face
from openai import OpenAI
hf = OpenAI(base_url="https://router.huggingface.co/v1", api_key="hf_...")
rlm = RLM(model="openai/gpt-oss-120b", client=hf)

Works with: OpenAI, Hugging Face, Anthropic (via proxy), vLLM, Ollama, LiteLLM, or anything OpenAI-compatible.


More ways to run

Visualizer (Gradio UI)
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra visualizer
uv run python examples/visualizer.py   # http://localhost:7860
OpenCode integration

1. Start the proxy:

uv run --with ".[proxy]" examples/proxy.py
# RLM Proxy initialized | model=gpt-5-mini | docker=False
# Uvicorn running on http://0.0.0.0:8000

2. Config (opencode/opencode.json): set provider.minrlm.api to http://localhost:8000/v1. See opencode/opencode.json.

3. Run:

OPENCODE_CONFIG=opencode.json opencode run "First prime after 1 million"
# > 1000003

Full tutorial

Docker sandbox

LLM-generated code runs in isolated Docker containers. No network, read-only filesystem, memory-capped, seccomp-filtered.

rlm = RLM(model="gpt-5-mini", use_docker=True, docker_memory="256m")
Run the benchmarks yourself
git clone https://github.com/avilum/minrlm && cd minrlm
uv sync --extra eval

# Smoke test
uv run python eval/quickstart.py

# Full benchmark (reproduces the tables above)
uv run python eval/run.py \
    --tasks all \
    --runners minrlm-reasoning,vanilla,official \
    --runs 50 --parallel 12 --task-parallel 12 \
    --output-dir logs/my_eval

Full results: eval/README.md

Examples
uv run python examples/minimal.py            # vanilla vs RLM side-by-side
uv run python examples/advanced_usage.py     # search, sub_llm, callbacks
uv run python examples/visualizer.py         # Gradio UI
uv run uvicorn examples.proxy:app --port 8000  # OpenAI-compatible proxy

Why this matters

Context window rot is real - model accuracy degrades as input grows, even when the answer is right there. Bigger windows aren't the fix. Less input, better targeted, is.

The same pattern is showing up everywhere: Anthropic's web search tool writes code to filter results, MCP standardizes code execution access, smolagents goes further. They all converge on the same idea: let the model use code to work with data instead of attending to all of it.

Feels less like "prompting" and more like giving the model a runtime.


Future work

  • More models - Claude Opus 4.6, Gemini 2.5, open-weight models. Does the scaling trend hold across providers?
  • Agentic pipelines - using the RLM pattern as a retrieval step inside multi-step agent workflows
  • More tasks - stress-testing edge cases and domains where the approach might break

Contributions welcome. Open an issue or PR.


Credits

Built by Avi Lumelsky. Independent implementation - not a fork.

The RLM concept comes from Zhang, Kraska, and Khattab (2025). Official implementation: github.com/alexzhang13/rlm.

Citation
@misc{zhang2026recursivelanguagemodels,
      title={Recursive Language Models},
      author={Alex L. Zhang and Tim Kraska and Omar Khattab},
      year={2026},
      eprint={2512.24601},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.24601},
}

Star History

Star History Chart

License

MIT

About

Stop forcing LLMs to answer in one pass. Give them a runtime. Recursive Language Model that improves any LLM, while reducing token usage up to 4X.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages