Welcome to OpenThoughts-Agent (OT-Agent for short), a large-scale research project dedicated to creating the best tooling and finding the best data for training small agentic models.
OT-Agent is a research codebase! Conventions will change, files will move and workflows will break as we continue to grow. Please bear with us and open an issue if you discover a bug.
If you are new to the project, start here to get up and running.
Start by creating a clean Python 3.12 virtual environment using your favorite environment manager (we use conda + mamba).
From the root directory, you can then install the core HPC + data infrastructure dependencies with:
pip install .
As this project contains many dependencies, we recommend the use of a package and project management tool such as uv within your virtual environment.
Optional extras:
-
HPC datagen runtime (Ray clusters + vLLM serving):
pip install .[datagen] -
SweSmith-specific datagen helpers (extends the above with bespoke tools):
pip install .[datagen,datagen-swesmith](orpip install .[datagen-swesmith]if you already pulled the base datagen extra) -
SFT stack:
- Ensure the git submodule is initialized:
git submodule update --init --recursive sft/llamafactory - Install LLaMA Factory directly from the submodule since OT-Agent does not ship an
[sft]extra:cd sft/llamafactory pip install -e .[train,liger-kernel,deepspeed] # select the extras you need cd -
- Training configs that pair with OT-Agent live under
sft/lf_configs/**; refer tosft/llamafactory/README.mdfor detailed flags and dependency notes.
- Ensure the git submodule is initialized:
-
Data stack
- Dataset tooling docs live under
data/README.md; install per-generator requirements in addition to thedatagenextras above when needed.
- Dataset tooling docs live under
Many OT-Agent launch modes JIT-compile CUDA/C++ extensions (e.g., flash-infer, flash-attn, triton). Those builds are sensitive to compiler and CUDA versions, so verify that the toolchain you expose to Python matches the version of PyTorch you installed (python - <<<'import torch; print(torch.version.cuda)'). We primarily test on CUDA 12.8/12.9 with GCC ≥12.
Cluster modules. If your HPC environment exposes the right stack, loading modules is the path of least resistance:
module load gcc/14.2.0
module load cuda/12.8Container shells. Some centers publish pre-baked CUDA images. Binding your workspace into one of those containers often guarantees a clean toolchain:
singularity shell --nv \
--bind $SCRATCH/ot-agent \
$SCRATCH/cuda-img/cuda-cudnn-12.8-ubuntu22.sifConda-provisioned toolchains. When neither modules nor containers provide what you need, install the compilers and sysroot via mamba. Keep the packages pinned so minor upgrades don’t silently change ABI compatibility:
mamba install -c conda-forge c-compiler cxx-compiler -y
mamba install -c conda-forge gcc_linux-64 gxx_linux-64 sysroot_linux-64 -y
mamba install -c conda-forge libstdcxx-ng=12 libgcc-ng=12 gcc_impl_linux-64 \
gxx_impl_linux-64 sysroot_linux-64 -yEnvironment variables. Point CUDA- and GCC-aware tools at the locations you provisioned. Adjust the paths below if your install lives somewhere else:
GCC_ROOT="$(dirname "$(dirname "$(which gcc)")")"
export CUDA_HOME=/usr/local/cuda
export CPATH="$CUDA_HOME/include${CPATH:+:$CPATH}"
export LIBRARY_PATH="$CUDA_HOME/lib64${LIBRARY_PATH:+:$LIBRARY_PATH}"
export LD_LIBRARY_PATH="$GCC_ROOT/lib64:$GCC_ROOT/lib:$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
export PATH="$CUDA_HOME/bin${PATH:+:$PATH}"Heavyweight builds. Once the toolchain is stable, the JIT pieces compile automatically on import. Some packages (like flash-attn) still require manual builds—install those last so you know the rest of the environment is steady, and make sure TORCH_CUDA_ARCH_LIST, NVCC_THREADS, etc. match your hardware:
UV_COMPILE_THREADS=4 MAX_JOBS=4 NVCC_THREADS=4 TORCH_CUDA_ARCH_LIST="9.0" \
pip install -v --no-build-isolation "flash-attn==2.8.1"Most scripts expect credentials (HF tokens, Daytona keys, W&B API keys, Supabase creds, etc.) to live in a private env file that is not committed to this repo. Point OT-Agent at your private file by exporting:
export DC_AGENT_SECRET_ENV=/secure/path/to/my_dc_agent_secrets.envThat file should export DAYTONA_API_KEY=..., export HF_TOKEN=..., export WANDB_API_KEY=..., export SUPABASE_*, etc. The launcher and auxiliary scripts now read DC_AGENT_SECRET_ENV; legacy KEYS/SECRET_ENV_PATH variables are still accepted for backward compatibility but will be removed once everyone migrates.
OT-Agent's job launchers are designed to work with HPC (high-performance computing) clusters. Different launchers exist for different job types. OT-Agent's launchers are modular, making it relatively straightforward to add your own preferred cluster. Every invocation of python -m hpc.launch must explicitly set --job_type; use sft for standard finetuning jobs, sft_mca when you need the Megatron Core Adapter sbatch templates, pretokenize for tokenization-only runs, datagen for generator/trace work, and consolidate for ZeRO merges.
Datagen jobs are launched via the generic HPC launcher and use --job_type datagen plus a generator script.
- Ensure your cluster environment is set up (dotenv, conda env, etc.). For TACC/Vista-style machines, follow the checklist in
hpc/README.mdand usehpc/dotenv/tacc.envas a starting point for your environment variables. - Activate your environment and source the dotenv:
The dotenvs now export
source hpc/dotenv/<your-cluster>.env eval "$DCFT_ACTIVATE_ENV" cd "$DCFT"
PYTHONPATH="${DCFT_PRIVATE:-$DCFT}:$PYTHONPATH"sopython -m hpc.launchresolves even on clusters that strip the working directory fromsys.path. If you maintain a custom dotenv, mirror this line to keep the launcher importable. - Choose or write a datagen script under
data/...implementingBaseDataGenerator(seedata/generation/base.pyand existing generators for examples). - Run the launcher from a login node:
python -m hpc.launch \ --job_type datagen \ --datagen_script data/<dataset>/generate.py \ --datagen_target_repo <org/dataset-tasks> \ --datagen_engine vllm_local \ --datagen_extra_args "--stage both --limit 200" \ --experiments_dir "$DCFT/experiments" \ --time_limit 12:00:00
- To also generate traces, add:
--enable_trace_gen--trace_target_repo <org/dataset-traces>--trace_harbor_config path/to/harbor_job.yaml
and any of thetrace_*overrides documented inhpc/README.md.
The launcher will synthesize and submit one or more sbatch scripts under "$experiments_dir/sbatch_scripts" and write configs to "$experiments_dir/configs". Use --dry_run to inspect scripts without actually calling sbatch.
SFT jobs are also launched via hpc.launch with --job_type sft and a LLaMA Factory config.
- Pull and install the SFT submodule (once per checkout) and install its dependencies in-place:
git submodule update --init --remote sft/llamafactory cd sft/llamafactory pip install -e .[train,liger-kernel,deepspeed] # pick the extras you need cd -
- Configure your cluster dotenv and environment as in the Datagen section.
- Pick a training config under
sft/lf_configsor create your own YAML alongside the existing presets. - From a login node, run:
python -m hpc.launch
--job_type sft
--train_config_path sft/lf_configs/.yaml
--dataset <org/dataset>
--num_nodes 8
--time_limit 24:00:00
--experiments_dir "$DCFT/experiments"
5. Optionally override LLaMA Factory flags via `--train_extra_args "..."` (see `hpc/README.md` and `sft/llamafactory/README.md` for full argument lists).
The launcher will construct a per-run YAML in `"$experiments_dir/configs"`, generate an sbatch script, and then submit the job. Training metadata and summaries are written into the run’s `output_dir`.
#### How to Launch an RL Job
Please check `rl/READM.md`.
#### How to add your cluster to OT-Agent
Adding a new cluster involves defining its resources, sbatch templates, and a dotenv file so `hpc.launch` can target it.
1. **Create a dotenv for your cluster** under `hpc/dotenv/`, following `tacc.env` as a template. At a minimum, define:
- `DCFT` (path to your OpenThoughts-Agent checkout on the cluster)
- `DCFT_ACTIVATE_ENV` (command to activate the Python env)
- paths for `EXPERIMENTS_DIR`, `DATASETS_DIR`, `MODELS_DIR`, and any cluster-specific SIF/Apptainer images.
2. **Register basic cluster metadata** by exporting `HPC_NAME` and related fields in your dotenv or by passing them on the CLI:
- `--name`, `--account`, `--partition`, `--gpus_per_node`, `--cpus_per_node`, etc. (see `hpc/README.md` and `hpc/hpc.py`).
3. **Create sbatch templates** in `hpc/sbatch_data/` for your cluster:
- Copy an existing template for a similar machine (GPU type / internet access) and adjust `#SBATCH` headers and module loads.
- Keep placeholders like `{time_limit}`, `{job_name}`, `{experiments_dir}` etc. intact; they will be filled by `hpc.launch`.
4. **Declare required templates** in `hpc/sbatch_data_requirements.json` so `_validate_sbatch_templates` can verify your cluster has all needed sbatch files for datagen and training.
5. **Test with a dry run**:
```bash
source hpc/dotenv/<your-cluster>.env
eval "$DCFT_ACTIVATE_ENV"
cd "$DCFT"
python -m hpc.launch \
--job_type datagen \
--datagen_script data/<dataset>/generate.py \
--datagen_target_repo test-org/test-dataset \
--experiments_dir "$DCFT/experiments" \
--dry_run
- Once sbatch scripts look correct, drop
--dry_runto submit real jobs. If your cluster needs special handling (login vs compute nodes, proxies, etc.), add it tohpc/hpc.pyand, if necessary,hpc/launch.py(for example, see the existing logic for JURECA/JUWELS internet nodes).
To learn more about the details of how HPC Launch works, please refer to hpc/README.md.
For non-HPC users, we provided a tutorial notebook under notebook/datagen_sft_tutorial.ipynb with an example of how we generate data from the inferredbugs dataset and perform SFT.
OT-Agent relies on Harbor to launch containerized tools for datagen and eval. Harbor supports multiple backends (Docker, Daytona, Modal, e2b, etc.), but most HPC centers either forbid Docker outright or only allow Apptainer/Singularity. In practice this means:
- Remote container providers are the default. We run large-scale datagen on managed platforms (Daytona, Modal, e2b) where Docker is available and network egress is unrestricted. Daytona is our primary provider; their infrastructure has handled millions of launches/day for OT-Agent workloads.
- HPC clusters stay lean. Login/compute nodes focus on scheduling, storage, and GPU time. When those jobs need containerized helpers (e.g., Harbor trace agents), they call out to the remote provider rather than trying to build Docker images locally.
- Configuration lives in Harbor YAMLs. Pick a template under
hpc/harbor_yaml/, set thetypefield to your provider (e.g.,daytona,modal,modal-ray), and make sure any required secrets/API keys are present in your runtime env (DC_AGENT_SECRET_ENVis sourced automatically). - Bring-your-own backend is fine. Any Harbor-compatible provider works as long as its CLI/API is reachable from the cluster you launch jobs on. If you need Podman or another backend we don’t support yet, open an issue/PR—Harbor makes it straightforward to add.
Once the Harbor YAML points at the right backend and credentials, OT-Agent’s launch scripts will provision containers, stream logs, and tear everything down automatically.
We are a collaboration led by researchers and engineers from Stanford, UC Berkeley, UT Austin, NYU, University of Washington, UCSD, ASU, CMU, UCLA, UNC Chapel Hill, TUM, LAION, and other partners focused on building the best datasets (and therefore the best models). See our previous work at datacomp.ai and mlfoundations.
We currently organize via the terminal-bench Discord; go there if you need help.
- For RL: Please contact Charlie Ruan
- For SFT: Please contact Benjamin Feuer
- For Data: Please contact Etash Guha
- For Eval: Please contact Negin Raoof
- For Project Management (includes cluster and account access): Please Contact Ryan Marten
@misc{openthoughts-agent,
author = {Team, OpenThoughts-Agent},
month = Dec,
title = {{OpenThoughts-Agent}},
howpublished = {https://open-thoughts.ai/agent},
year = {2025}
}
