Thoughtbubbles

Parallel adaptive computation for language models.

Getting Started

Three main steps to get started pre-training your very own Thoughtbubbles model:

an environment with our dependencies
prepare the dataset, either OpenWebText or peS2o
train Thoughtbubbles!

Environment

We use uv as our main dependency manager. To install it, follow the instructions here.

Once you have uv installed, you can get started on installing the dependencies! Its a tiny bit tricky to do so, because the torch-scatter package needs to be installed separately once most of the environment is built.

uv sync --no-install-package torch-scatter && uv sync

Prepare the Dataset

We left some dataset preparation scripts in the scripts folder, both of which takes a single argument, which is the output data path. To use them, run:

python scripts/prepare_openwebtext.py /path/to/output_openwebtext
python scripts/prepare_pes2o.py /path/to/output_pes2o

Notably, our training infrastructure detects the string openwebtext or pes2o in the data path to determine which dataset is being used, so make sure to include those strings in the output path.

Train Thoughtbubbles

Lastly, its the fun part of training thoughtbubbles! The main entrypoint is main.py, which takes a lot of arguments. Generally, the structure of the command is:

python main.py experiment_name --out_dir /path/to/storage --data_dir /path/to/dataset --wandb [other arguments]

We left a series of commands to reproduce our main sweeps in the experiments folder. You can run them directly to obtain models on both datasets scaling from 150M to 772M parameters.

Use python main.py --help to see all available configurations, ranging from model topology, dataset construction, reporting, as well as many exact design parameters of the forking mechanism which you can play around with.

Programmatic Access

You have a checkpoint, now what?! You can load the model and use it programmatically. Start a Python shell at the root of this repository, and then run:

from trainer import Trainer

# note that our script makes a "best" and "checkpoint" subfolder
# you want to pass either of those *subfolders there instead
# of the overall folder, depending on if you want the latest
# checkpoint or the lowest-dev-set-perplexity one
trainer = Trainer.from_checkpoint("/path/to/save_folder/")
model = trainer.model

# we can get a batch of pretokenized data and play around with it!
x,y,_ = trainer.batch()
logits, loss = model(x, y)

We use the gpt2 tokenizer encoding scheme from tiktoken, so if you want to input your own sequences you can use the tiktoken package to do so.

import tiktoken
import torch

enc = tiktoken.get_encoding("gpt2")
tokens = enc.encode("Hello world!")

trainer = Trainer.from_checkpoint("/path/to/save_folder/")
model = trainer.model

x = torch.tensor(tokens).unsqueeze(0)  # add batch dimension
logits, _ = model(x)

Citation

If you find this repository useful, please consider citing our paper:

@misc{liu_thoughtbubbles_2025,
	title = {Thoughtbubbles: an {Unsupervised} {Method} for {Parallel} {Thinking} in {Latent} {Space}},
	shorttitle = {Thoughtbubbles},
	url = {http://arxiv.org/abs/2510.00219},
	doi = {10.48550/arXiv.2510.00219},
	abstract = {Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and are limited to only serially-generated, natural-language verbalization to scale inference-time compute. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens that require a large amount of computation can form a "bubble" of cloned residuals in the middle of the network for additional thinking. Crucially, this behavior is learned during pretraining with only language modeling loss. Thoughtbubbles outperforms both standard decoder LMs as well as non-adaptive parallel computation approaches on OpenWebText and peS2o perplexity and in zero-shot evaluations such as HellaSwag and LAMBADA after pretraining across 150M to 772M parameter scales. The implicit nature of our method enables adaptive computation to be learned starting at pretraining time, paving the way to unify train and test-time behavior for reasoning models.},
	urldate = {2025-10-02},
	publisher = {arXiv},
	author = {Liu, Houjun and Murty, Shikhar and Manning, Christopher D. and Csordás, Róbert},
	month = sep,
	year = {2025},
	note = {arXiv:2510.00219 [cs]}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiments		experiments
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
commands.py		commands.py
data.py		data.py
main.py		main.py
model.py		model.py
parameters.py		parameters.py
plots.py		plots.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
trainer.py		trainer.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Thoughtbubbles

Getting Started

Environment

Prepare the Dataset

Train Thoughtbubbles

Programmatic Access

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

stanfordnlp/thoughtbubbles

Folders and files

Latest commit

History

Repository files navigation

Thoughtbubbles

Getting Started

Environment

Prepare the Dataset

Train Thoughtbubbles

Programmatic Access

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages