This is an unofficial implementation of RLM paper.
Original Paper: Recursive Language Models by Alex L. Zhang, Tim Kraska & Omar Khattab
- What is RLM (Layman Language)
- Extra Additions In This Implementation
- Run Locally
- Codebase Overview
- Acknowledgements & Gratitude
NO BS! Let's break it down into simple terms.
- You start with an LLM (let's call it the
root llm) - You give it access to a
ptyhon replenvironment (I'll discussreplin details later) - You have a big dataset (let's say around 1 million+ tokens), you can't load that dataset into the
root llmdirectly. So, you give the dataset path to theroot llm. - The
root llmactually writes a python code to load the dataset. - Then we pass the
python codestring to theREPLenvironment. Which executes the code and save it's state (remember jupyter notebook where each cells variable, dicts everythings are saved here is also same). - Then after saving the file, now you start asking question that can be found in the dataset.
Root LLMwill again generate python code to start finding the answer from the dataset.- Here it will generate code which can
- peek into the first
x amountcharacters - use regex to find specific sections
- split the dataset into chunks and analyze one by one iterating over loop
- it can also delegate a specific chunk to a sub LLM to find a specific ans from the chunk
- peek into the first
- The
Root LLMwill recusively generate python code and use it to find answers until it finds the answer. - You see the
root llmis never loading the entire text into the context, it is just utilizing theREPLenvironment by generating python code and seeing the output of that python code. That's it! That's the whole idea of RLM.
No.
- RLM is treating the entire dataset as a variable.
- RLM is using the
REPLenvironment to interact with the dataset. - RLM utilizing code (symbolic language) to call
sub_llmfor initializing sub agents, never it is calling as a tool - RLM is only generating the
codeand seeing the output of thatcode(usually the output is truncated to avoid context rot). - RLM can bypass the output token limit in base LLM by using the
FINALfunction to deliver the final answer directly from theREPLenvironment. (usually tool response goes back to LLM, but here the final answer is delivered directly from theREPLenvironment.) - RLM can iteratively update it's answer without ever seeing the full answer (**This works due to recursion that RLM is aware of the changes).
Above are the reasons why RLM is not just a tool calling with python in a Loop.
One of the main author Omar Khattab from RLM paper explained this in detail in on this X thread
REPL is a short form for Read Eval Print Loop. It is a simple environment that allows you to execute python code and see the output of that code.
- You generate a python code
- You ran the python code
- You see the output of the python code
- & the loop continues...
So, we give access to the REPL environment with state to the LLM (here is the root llm) as tool. And the LLM utilizing the REPL environment to interact with the dataset without directly intracting with it. In the paper they mentioned the entire prompt (dataset) is treated as a variable. So, the root llm can access the dataset through the REPL environment.
To know more about it I would suggest you to look at the original paper. They discuss in detailed and also shows overall performance in various benchmarks.
The paper implementation uses python repl environment in the direct host machine which makes it super dangerous. Suppose if the LLM generates code to delete important file system or malicious code then it'd effect the host machine.
To solve this issue, I used dokcer container to run the repl environment. So, the repl environment is isolated from the host machine. In the run time I am mounting the dataset files to the docker image (alongside env) and then the repl is running inside the docker container. If there was a malicious code then it'd run in the docker container and won't effect the host machine.
In the original paper they didn't mention about memory compaction after each interaction. But while I was testing it I found:
-
In general the
root llmwas calling thereplenvironment multiple times to find the answer. Which was consuming a lot of token if we think to add all of this interaction through out the session -
Therefore, I've implemented a simple interaction based memory compaction. After getting each response (final response) the entire interaction will be summarized into a single message. In this way if there is 10 repl env call then the next msg won't contain only the summary of what happened last interaction without bloating the context with unnecessary details.
This project is an unofficial implementation based on my personal understanding of the RLM paper and various research materials available online.
- Not an Exact Replica: While it follows the core principles of the paper, this implementation is not an exact mirror of the authors' original work.
- Custom Additions: As mentioned in the Extra Additions section, I have introduced features like Docker sandboxing and Memory Compaction which are not present in the original paper.
- Potential Mistakes: This is a "from-scratch" exploration; there may be bugs, architectural differences, or misunderstandings of specific paper details.
Use this as a learning resource or a starting point for your own RLM experiments.
- Docker (make sure it can be run without
sudo) uvfor package managementpython3.12+.envfile with API keys (I've added a.env.examplefile for reference)
Copy the example environment file and fill in your API keys:
cp .env.example .envEnsure you have the necessary keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) depending on the models you choose in config.json.
Install the required Python packages using uv:
uv syncThe config.json file controls the LLMs and system behavior:
root_llm: The primary model that orchestrates the reasoning and writes Python code.sub_llm: The model used for semantic analysis of text chunks (map operations).glm_coding: (Optional) Use Zhipu AI's GLM coding subscription api directly. If it is enabled then theroot_llmandsub_llmwill be using the GLM models.memory_compaction: When enabled, it summarizes long tool-call histories into a single context-saving summary after each interaction.
Example config.json:
{
"root_llm": {
"model": "openrouter/google/gemini-2.0-flash-001"
},
"sub_llm": {
"model": "openrouter/google/gemini-2.0-flash-001"
},
"glm_coding": {
"enabled": false,
"api_base": "https://api.z.ai/api/coding/paas/v4",
"root_model": "openai/glm-4.7",
"sub_model": "openai/glm-4.7"
},
"memory_compaction": {
"enabled": true
}
}The project uses a Click-based CLI for interaction:
init <context_file>: (Optional) Pre-loads a large dataset/file into the REPL state.chat: Starts an interactive session with the Root LLM.run "<query>": Executes a single query and returns the answer.status: Displays the current state of the REPL (loaded files, buffers, etc.).
You can start by initializing a file, or just jump straight into a chat and let the LLM load files as needed.
Option A: Initialize first (Preferable for large files)
uv run main.py init /absolute/path/to/data/my_huge_dataset.txt
uv run main.py chatOption B: Start chat directly
uv run main.py chatOnce the chat starts, you can simply say: "Load the file at /absolute/path/to/data/info.txt and tell me what's in it." The Root LLM will automatically generate the Python code to load and analyze it.
Option C: Run a single query
uv run main.py run "Analyze /absolute/path/to/data/logs.txt and find all error messages"During a session, the Root LLM manages the context by writing Python code that utilizes helper functions within the REPL environment. You can simply ask the LLM to perform these actions using natural language, and it will use the following helpers:
Important
Use Absolute Paths: When requesting the LLM to load files, you must provide the absolute path (e.g., /home/user/data.txt). This is required so the system can correctly mount the file into the isolated Docker container.
load_file(path, name=None, replace_all=False): Used by the LLM to load a new file into memory.- Example Query: "Load the file at /home/user/data/logs.txt and call it 'logs'"
- Example Query (Replace): "Replace your current context with the file at /home/user/new_data.txt"
load_files(paths_list): Used to batch-load multiple files.- Example Query: "Load these three files: /home/user/f1.txt, /home/user/f2.txt, and /home/user/f3.txt"
switch_to(name): Used to change which loaded file is "active" for analysis helpers likepeek()orgrep().- Example Query: "Switch to 'f2' and search for all 'ERROR' entries"
list_files(): Used by the LLM to see what is currently loaded in its sandbox memory.- Example Query: "What files do you have loaded right now?"
remove_file(name): Used to drop a specific file from the REPL state.- Example Query: "Remove the file 'logs' from your context"
The REPL state (variables, loaded files, buffers) is persisted in the .rlm_state/ directory.
- To start with a fresh state, simply delete the
.rlm_statedirectory:
rm -rf .rlm_stateIn the current implementation, the chat message history is stored in-memory. This means your conversation history will be lost when you close the program. However, since the REPL state is persisted on disk, any files loaded or variables defined in the REPL will remain available when you restart the program.
For a detailed understanding of how the project is structured, how the REPL sandbox works, and the inner workings of the Orchestrator, please refer to the Codebase Overview.
Building this from scratch wouldn't have been possible without the amazing resources and insights shared by the community. A huge thank you to all of these people:
- ABV on X: For the insightful article on Recursive Language Models (RLMs).
- Deep Learning with Yacine: For the live stream session with Alex L. Zhang (the RLM author), which provided invaluable context.
- Brainqub3: For the video on Zero-Setup RLMs with Claude Code. I used his REPL implementation as the foundation for the sandbox in this project.
- Yacine's Motivation: This video specifically motivated me to dive deep into the paper and build this from scratch instead of just following the hype:
- Initial release of RLM from scratch implementation.
- Fixed the issue where the final answer was going back to root llm. Original paper implementation the final answer should be directly delivered from the REPL environment using
FINALfunction.


