This repo is a training framework that is driving my sample efficiency experiments for pre-training LLMs.
A goal to explore the frontier of sample efficiency of small language models.
Main achievements that makes this framework stand out across others:
- Gated Attention (https://arxiv.org/abs/2505.06708)
- Value Residual Learning (https://arxiv.org/abs/2410.17897)
- Muon + Triton implementation from Dion repo
- LayerNorm Scaling (https://arxiv.org/abs/2502.05795)
- QK-Norm + I wrote Flash attention QK-norm kernel for max efficiency
- Z-loss (https://arxiv.org/pdf/2204.02311)
- muP parametrization (reference from https://arxiv.org/abs/2505.02222)
- SuperBPE tokenizer with the conversion to HF tokenizers (https://arxiv.org/abs/2503.13423)
- Individual WD for muP transfer + Cautious Weight Decay (https://arxiv.org/abs/2510.12402)
Various optimization tricks, such as momentum warmup, WD schedule
uv sync- Train tokenizer and tokenize data (skip if you already have tokenizer)
-
Train a tokenizer:
(Look into tokenizer/README.md -- TBD)
-
Tokenize data
uv run tokenize_with_fast_tokenizer.py --data-path DATA_PATH --tokenized-data-path TOKENIZED_DATA_PATH --include-val-data=1 --tokenizer-name HF_TOKENIZER_NAME
DATA_PATHis expected to be a folder with either parquet or txt filesIn case of txt files,
TOKENIZED_DATA_PATHwill contain .npy files with matching names with tokenized dataIn case of parquet,
TOKENIZED_DATA_PATHwill contain .npz files for every chunk in your datasetHF_TOKENIZER_NAMEis a huggingface transformers format, can accept hf name or folder
- Run training
Sample script to run 70M LLM training on RTX 5090 (note that I specify device "cuda:2" in a script):
bash baseline.shYou can modify k/v pairs in config with --override argument to train.py
Note that you have to provide both train and validation datasets in tokenized format as npy files.
- DDP (X)
- FSDP
- Evals (X)
- FP8 training (+-)
- MoE
Not fully verified:
- Seesaw schedule
- Prodigy optimizer