Implement Andrej Karptathy's nanoGPT
This is a repo trying to train nanoGPT under 3 mins from scratch.
We apply these changes to nanoGPT
- Rotary embedding
- Normalize Q,K
- ReLu^2
- Uniform and zero weight initialization
- Skip connections(Encoding/Decoding)
- Muon Optimizer
Now you can train a GPT on a cheap NVIDIA chip under 24 hours.