Skip to content

d1pankarmedhi/smallLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SmallLM

A small, GPT-like Language Model with KV Cache and Batched Attention

PyTorch Python

This project provides a minimal, modular, and extensible framework for training and generating text with a transformer-based GPT-like language model. It features a small language model optimized with KV Caching for fast inference.

Model Architecture BPE Tokenization

Key Features

  • Batched Attention: Optimized multi-head attention that projects and computes all heads in parallel, doubling raw throughput.
  • KV Cache: Accelerates autoregressive generation by caching and reusing Keys and Values from previous tokens, providing up to 4.5x speedup on larger model configurations.
  • Detailed Documentation: Comprehensive guides on architecture and KV cache.

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/smallLM.git
    cd smallLM
  2. Install dependencies:

    pip install -r requirements.txt

Usage

1. Training

python main.py train

The best model checkpoint will be saved to checkpoints/best_model.pt.

2. Fast Generation (with KV Cache)

python main.py generate --query "Once upon a time" --max_new_tokens 100 --use_kv_cache
  • --use_kv_cache: Enables the Key-Value cache for faster inference.
  • --max_new_tokens: Number of tokens to generate.
  • --temperature: Sampling temperature.

3. Benchmarking

Compare performance with and without KV cache:

# Benchmark default model
python benchmark_kv_cache.py --max_new_tokens 200

# Benchmark a larger model to see scaling benefits
python benchmark_kv_cache.py --max_new_tokens 500 --n_embd 768 --n_layer 12 --n_head 12
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RESULTS                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                  β”‚  Without Cache   β”‚   With Cache     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Tokens generated β”‚              500 β”‚              500 β”‚
β”‚ Time (seconds)   β”‚          28.2134 β”‚           6.1797 β”‚
β”‚ Tokens/sec       β”‚            17.72 β”‚            80.91 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Speedup: 4.57x                                         β”‚
| GPU: Nvidia GTX 1650                                   |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

License

MIT License

Acknowledgements

Inspired by GPT and nanoGPT projects.

About

🧱 A small Language Model with BPE Tokenizer and KV Caching.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages