Parallel adaptive computation for language models.
Three main steps to get started pre-training your very own Thoughtbubbles model:
- an environment with our dependencies
- prepare the dataset, either OpenWebText or peS2o
- train Thoughtbubbles!
We use uv as our main dependency manager. To install it, follow the instructions here.
Once you have uv installed, you can get started on installing the dependencies! Its a tiny bit tricky to do so, because the torch-scatter
package needs to be installed separately once most of the environment is built.
uv sync --no-install-package torch-scatter && uv sync
We left some dataset preparation scripts in the scripts
folder, both of which takes a single argument, which is the output data path. To use them, run:
python scripts/prepare_openwebtext.py /path/to/output_openwebtext
python scripts/prepare_pes2o.py /path/to/output_pes2o
Notably, our training infrastructure detects the string openwebtext
or pes2o
in the data path to determine which dataset is being used, so make sure to include those strings in the output path.
Lastly, its the fun part of training thoughtbubbles! The main entrypoint is main.py
, which takes a lot of arguments. Generally, the structure of the command is:
python main.py experiment_name --out_dir /path/to/storage --data_dir /path/to/dataset --wandb [other arguments]
We left a series of commands to reproduce our main sweeps in the experiments
folder. You can run them directly to obtain models on both datasets scaling from 150M to 772M parameters.
Use python main.py --help
to see all available configurations, ranging from model topology, dataset construction, reporting, as well as many exact design parameters of the forking mechanism which you can play around with.
You have a checkpoint, now what?! You can load the model and use it programmatically. Start a Python shell at the root of this repository, and then run:
from trainer import Trainer
# note that our script makes a "best" and "checkpoint" subfolder
# you want to pass either of those *subfolders there instead
# of the overall folder, depending on if you want the latest
# checkpoint or the lowest-dev-set-perplexity one
trainer = Trainer.from_checkpoint("/path/to/save_folder/")
model = trainer.model
# we can get a batch of pretokenized data and play around with it!
x,y,_ = trainer.batch()
logits, loss = model(x, y)
We use the gpt2
tokenizer encoding scheme from tiktoken, so if you want to input your own sequences you can use the tiktoken
package to do so.
import tiktoken
import torch
enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello world!")
trainer = Trainer.from_checkpoint("/path/to/save_folder/")
model = trainer.model
x = torch.tensor(tokens).unsqueeze(0) # add batch dimension
logits, _ = model(x)
If you find this repository useful, please consider citing our paper:
@misc{liu_thoughtbubbles_2025,
title = {Thoughtbubbles: an {Unsupervised} {Method} for {Parallel} {Thinking} in {Latent} {Space}},
shorttitle = {Thoughtbubbles},
url = {http://arxiv.org/abs/2510.00219},
doi = {10.48550/arXiv.2510.00219},
abstract = {Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a "bubble" of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.},
urldate = {2025-10-02},
publisher = {arXiv},
author = {Liu, Houjun and Murty, Shikhar and Manning, Christopher D. and Csordás, Róbert},
month = sep,
year = {2025},
note = {arXiv:2510.00219 [cs]}
}