GitHub - thepowerfuldeez/sample_efficient_gpt: Training framework with a goal to explore the frontier of sample efficiency of small language models

Sample Efficient GPT

This repo is a training framework that is driving my sample efficiency experiments for pre-training LLMs.

A goal to explore the frontier of sample efficiency of small language models.

Main achievements that makes this framework stand out across others:

Gated Attention (https://arxiv.org/abs/2505.06708)
Value Residual Learning (https://arxiv.org/abs/2410.17897)
Muon + Triton implementation from Dion repo
LayerNorm Scaling (https://arxiv.org/abs/2502.05795)
QK-Norm + I wrote Flash attention QK-norm kernel for max efficiency
Z-loss (https://arxiv.org/pdf/2204.02311)
muP parametrization (reference from https://arxiv.org/abs/2505.02222)
SuperBPE tokenizer with the conversion to HF tokenizers (https://arxiv.org/abs/2503.13423)
Individual WD for muP transfer + Cautious Weight Decay (https://arxiv.org/abs/2510.12402)

Various optimization tricks, such as momentum warmup, WD schedule

Quickstart

uv sync

Train tokenizer and tokenize data (skip if you already have tokenizer)

Train a tokenizer:

(Look into tokenizer/README.md -- TBD)
Tokenize data
```
uv run tokenize_with_fast_tokenizer.py --data-path DATA_PATH --tokenized-data-path TOKENIZED_DATA_PATH --include-val-data=1 --tokenizer-name HF_TOKENIZER_NAME
```
DATA_PATH is expected to be a folder with either parquet or txt files

In case of txt files, TOKENIZED_DATA_PATH will contain .npy files with matching names with tokenized data

In case of parquet, TOKENIZED_DATA_PATH will contain .npz files for every chunk in your dataset

HF_TOKENIZER_NAME is a huggingface transformers format, can accept hf name or folder

Run training

Sample script to run 70M LLM training on RTX 5090 (note that I specify device "cuda:2" in a script):

bash baseline.sh

You can modify k/v pairs in config with --override argument to train.py

Note that you have to provide both train and validation datasets in tokenized format as npy files.

To-do:

DDP (X)
FSDP
Evals (X)
FP8 training (+-)
MoE

Not fully verified:

Seesaw schedule
Prodigy optimizer

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
sample_efficient_gpt		sample_efficient_gpt
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample Efficient GPT

Quickstart

To-do:

About

Uh oh!

Releases

Packages

Languages

thepowerfuldeez/sample_efficient_gpt

Folders and files

Latest commit

History

Repository files navigation

Sample Efficient GPT

Quickstart

To-do:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages