GPT-2 Pretraining

A from-scratch GPT2 implementation pretrained on a 5 billion token subset of the FineWeb-Edu dataset's 10B subset, Modeled after Karpathy's nanoGPT. Achieves 28.65% accuracy on HellaSwag validation, coming within 0.3% of the original GPT-2 small's 28.92% accuracy despite training on just half the data. (GPT2 was trained on WebText ~10Bn tokens). This I assume is because of different dataset quality.

HellaSwag Benchmark

Model	Accuracy	Multiple Choice Accuracy	Training Tokens	Dataset
GPT-2 small (124M)	28.92%	31.14%	~10B	WebText
mine	28.65%	29.55%	5B	FineWeb-Edu (subset*)

*5Bn subset of Fineweb-Edu's Sample 10Bn dataset

Loss

Sample Output

Input: Hello, I'm a language model,
> Hello, I'm a language model, you're asking me if I want a simple way to explain what a simple language will do but you're wondering
> Hello, I'm a language model, so what I'd like to do is to change the parameters and let me try it for my future job.
> Hello, I'm a language model, so I know that when you don't know where to start, I can think of no one. When I
> Hello, I'm a language model, so if you get lost in a language, try to learn English. I used to say English is a language
> Hello, I'm a language model, so here we go here:
I think that there is a need for a language-centered framework that addresses

Files

model.py: Minimal GPT2 implementation
train.py: Training script with multi-gpu training support
utils.py: Mainly contains data loader.
download-fineweb-edu.py: Downloads the FineWeb Edu (10Bn) dataset from Hugging Face, tokenizes it, and saves it in shards.

Training

Download dataset

python download-fineweb-edu.py

Start training:

./train.sh

This script automatically detects available GPUs and uses DDP for multi-GPU training

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
logs		logs
misc		misc
.gitignore		.gitignore
README.md		README.md
download-fineweb-edu.py		download-fineweb-edu.py
download-tiny-shakespear.sh		download-tiny-shakespear.sh
hellaswag.py		hellaswag.py
model.py		model.py
requirements.txt		requirements.txt
run_log_2025-07-05_05-42-02.txt		run_log_2025-07-05_05-42-02.txt
train.py		train.py
train.sh		train.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPT-2 Pretraining

HellaSwag Benchmark

Loss

Sample Output

Files

Training

About

Uh oh!

Releases

Packages

Languages

mnjm/gpt2

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Pretraining

HellaSwag Benchmark

Loss

Sample Output

Files

Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages