Transformer Implementation [code]
- KV Cache [code]
Stateless transformer design with external cache management - Rotary Positional Embeddings [blog] [code]
Both interleaved and half-flipped rotations; direct application and factory patterns for different use cases
Mech Interp [code]
- Toy Models of Superposition
Reproduce 5→2→5 experiments, plotting feature directions in the compressed activation space - Sparse Autoencoders
ReLU, TopK and BatchTopK implementations. Trained to recover features from toy models