This project provides a minimal, modular, and extensible framework for training and generating text with a transformer-based GPT-like language model. It features a small language model optimized with KV Caching for fast inference.
- Batched Attention: Optimized multi-head attention that projects and computes all heads in parallel, doubling raw throughput.
- KV Cache: Accelerates autoregressive generation by caching and reusing Keys and Values from previous tokens, providing up to 4.5x speedup on larger model configurations.
- Detailed Documentation: Comprehensive guides on architecture and KV cache.
-
Clone the repository:
git clone https://github.com/yourusername/smallLM.git cd smallLM -
Install dependencies:
pip install -r requirements.txt
python main.py trainThe best model checkpoint will be saved to checkpoints/best_model.pt.
python main.py generate --query "Once upon a time" --max_new_tokens 100 --use_kv_cache--use_kv_cache: Enables the Key-Value cache for faster inference.--max_new_tokens: Number of tokens to generate.--temperature: Sampling temperature.
Compare performance with and without KV cache:
# Benchmark default model
python benchmark_kv_cache.py --max_new_tokens 200
# Benchmark a larger model to see scaling benefits
python benchmark_kv_cache.py --max_new_tokens 500 --n_embd 768 --n_layer 12 --n_head 12ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESULTS β
ββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββ€
β β Without Cache β With Cache β
ββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββ€
β Tokens generated β 500 β 500 β
β Time (seconds) β 28.2134 β 6.1797 β
β Tokens/sec β 17.72 β 80.91 β
ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββ€
β Speedup: 4.57x β
| GPU: Nvidia GTX 1650 |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MIT License
Inspired by GPT and nanoGPT projects.

