This is a (biased) view of great work studying the building blocks of efficient and performant foundation models. This Github was originally put together as a place to aggregate materials for a NeurIPS keynote - but we're also hoping to highlight great work across AI Systems. If you think we're missing something, please open an issue or PR!
Slides from Chris Ré's NeurIPS Keynote: https://cs.stanford.edu/~chrismre/papers/NeurIPS23_Chris_Re_Keynote_DELIVERED.pptx
Courses. Courses a great resources for getting started in this space. It's great that we have so many that have open materials! Here's a partial list of courses -- it's biased by Stanford courses, so please reach out if you think of other resources that are helpful!
- Stanford CS 324 LLMs
- Stanford CS 324 Advances in Foundation Models
- Sasha's talk on do we need attention?
- Stanford CS 229S Systems for Machine Learning
- MLSys Seminar
- Berkeley AI-Sys
- MIT CS 6.5940
If you just want to follow along on the major pieces from the talk, check out these blog posts:
- Data Wrangling with Foundation Models
- FlashAttention and FlashAttention-2
- Simplifying S4
- Long Convolutions for GPT-style Models
- Zoology Synthetics Analysis
- Zoology Based
- Truly Sub-Quadratic Models
An older set of resources on Data-Centric AI.
The rest of this README is split up into resources by topic.
Table of contents:
- Foundation Models for Systems
- Hardware-Aware Algorithms
- Can We Replace Attention?
- Synthetics for Language Modeling
- Truly Sub-Quadratic Models
- Quantization, Pruning, and Distillation
- Systems for Inference
- High-Throughput
- New Data Types
Foundation models are changing the ways that we build systems for classical problems like data cleaning. SIGMOD keynote on this topic. Ihab Ilyas and Xu Chen's textbook on data cleaning: Data Cleaning. The ML for Systems workshops and community are great.
- Bad Data Costs the U.S. $3 Trillion Per Year
- Data Wrangling with Foundation Models
- Ask Me Anything: Leveraging Foundation Models for Private & Personalized Systems
- Holoclean: Holistic Data Repairs with Probabilistic Inference
- Can Foundation Models Wrangle Your Data?
- Can Foundation Models Help Us Achieve Perfect Secrecy? and ConcurrentQA
- Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
- Symphony: Towards Natural Language Query Answering Over Multi-Modal Data Lakes
- CodexDB: Synthesizing Code for Query Processing from Natural Language Instructions using GPT-3 Codex
- CHORUS: Foundation Models for Unified Data Discovery and Exploration
- How Large Language Models Will Disrupt Data Management
- GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization
- Jellyfish: A Large Language Model for Data Preprocessing
- Can Large Language Models Predict Data Correlations from Column Names?
Hardware-aware algorithms for today's ML primitives. Canonical resources:
- A classic look at I/O complexity, from the database folks: The input/output complexity of sorting and related problems.
- The canonical book on computer architectures: Computer Architecture: A Quantitative Approach.
- The canonical text book for everything FFT's: Computational Frameworks for the Fast Fourier Transform.
Jim Gray's Turing Award Profile.
- Horace He's Making Deep Learning Go Brrrr from First Principles
- Aleksa Gordic's ELI5 for FlashAttention
- FlashAttention
- FlashFFTConv
- Sasha's GPU Puzzles
- FlashAttention and FlashAttention-2
- Self-Attention Does Not Need O(N^2) Memory
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
- tcFFT: Accelerating Half-Precision FFT through Tensor Cores
- Cooley-Tukey FFT Algorithm
- Efficient Memory Management for Large Language Model Serving with PagedAttention
- Ring Attention with Blockwise Transformers for Near-Infinite Context
- Faster Causal Attention Over Large Sequences Through Sparse Flash Attention
- FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- Efficiently Scaling Transformer Inference
- Microsoft DeepSpeed
- Eleuther's GPT-NeoX Repo
- A Systematic Approach to Blocking Convolutional Neural Networks
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- Blockwise Self-Attention for Long Document Understanding
Alternatives to attention that scale sub-quadratically in sequence length. Canonical text on signal processing: Discrete-Time Signal Processing. High-level overview of this space: From Deep to Long Learning.
- What is a long convolution?
- Can Longer Sequences Help Take the Next Leap in AI?
- Simplifying S4
- Sasha's Great Annotated S4
- H3: Language Modeling with State Space Models and (Almost) No Attention
- Hyena Blog
- Mamba tweet threads by Albert and Tri
- StripedHyena-7B
- Zoology
- Zoology Analysis
- Based Architecture
- Long Range Arena
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces and code
- Zoology: Measuring and improving recall in efficient language models
- RWKV and code
- Efficiently Modeling Long Sequences with Structured State Spaces
- Long Range Language Modeling via Gated State Spaces
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models
- Hyena Hierarchy: Towards Larger Convolutional Language Models
- Simplified State Space Layers for Sequence Modeling
- On the Parameterization and Initialization of Diagonal State Space Models
- Mega: Moving Average Equipped Gated Attention
- Simple Hardware-Efficient Long Convolutions for Sequence Modeling
- Diagonal State Spaces are as Effective as Structured State Spaces
- Retentive Network: A Successor to Transformer for Large Language Models
- Resurrecting Recurrent Neural Networks for Long Sequences
- MultiResFormer: Transformer with Adaptive Multi-Resolution Modeling for General Time Series Forecasting
- CKConv: Continuous Kernel Convolution For Sequential Data
- Pretraining Without Attention
- Diffusion Models Without Attention
- Liquid Structural State-Space Models
- Fourier Neural Operator for Parametric Partial Differential Equations
There's also a great literature around approximating attention (sparse, low-rank, etc). Just as exciting! Here's a partial list of great ideas in this area:
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- Reformer: The Efficient Transformer
- Rethinking Attention with Performers
- Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
- Linformer: Self-Attention with Linear Complexity
- Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method
- Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
- Big Bird: Transformers for Longer Sequences
- Luna: Linear Unified Nested Attention
- FNet: Mixing Tokens with Fourier Transforms
- The Devil in Linear Transformer
- cosFormer: Rethinking Softmax in Attention
In research on efficient language models, synthetic tasks (e.g. associative recall) are crucial for understanding and debugging issues before scaling up to expensive pretraining runs.
We've created a very simple GitHub repo with a simple playground for understanding and testing langauge model architectures on synthetic tasks: HazyResearch/zoology.
- Zoology blog post on synthetics
- H3 blog post section on associative recall
- Anthropic's great explainer of associative recall in induction heads
- Zoology section 3-4
- H3 section 3.1
- In-context Learning and Induction Heads
- Associative Long Short-Term Memory
- Using Fast Weights to Attend to the Recent Past
- Learning to update Auto-associative Memory in Recurrent Neural Networks for Improving Sequence Memorization
- Self-Attentive Associative Memory
- Neural Turing Machines
- Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks
- Synthetic tasks go all the way back to LSTMs: Long Short-Term Memory
ML models are quadratic along another dimension -- model width. Can we develop models that grow sub-quadratically with model width?
The canonical textbook for a lot of this stuff: Structured Matrices and Polynomials.
- Towards Truly Subquadratic Models
- M2-BERT: Revisiting BERT, Without Attention or MLPs
- Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models
- Butterflies Are All You Need: A Universal Building Block for Structured Linear Maps
- Monarch Mixer
- Monarch
- Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
- Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations
- Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps
- Fast Algorithms for Spherical Harmonic Expansions
- Butterfly Factorization
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- A Two Pronged Progress in Structured Dense Matrix Multiplication
Quantization, pruning, and distillation are great techniques to improve efficiency. Here's just a short overview of some of the ideas here:
- QLoRA: Efficient Finetuning of Quantized LLMs
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
- Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
- QuIP#: QuIP with Lattice Codebooks
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
- Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT
- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
- LoRA: Low-Rank Adaptation of Large Language Models
- MCUNet: Tiny Deep Learning on IoT Devices
- MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training
Inference is an increasingly important cost for LLMs: a model will be served many more times than it is trained. Systems for inference are an increasingly important problem. Here's some papers and posts on the topic, there's a lot to do!
- Fast Transformer Decoding: One Write-Head is All You Need
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- Flash-Decoding for long-context inference
- vLLM
- Fast Inference from Transformers via Speculative Decoding
- MatFormer: Nested Transformer for Elastic Inference
- Efficient Streaming Language Models with Attention Sinks
- Hugging Face TGI
- NVIDIA TensorRT
- Together Inference Engine
- Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Foundation models will increasingly be used to serve back-of-house tasks like document processing (not just chat interfaces). These will require different systems than our current inference solutions. This work is still very new, but hopefully there's a lot more to come soon!
- Batch computing and the coming age of AI systems.
- FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
- Evaporate: Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes
Most ML models focus on text or images, but there's a large variety of other modalities that present unique challenges (e.g., long context). New modalities will drive advances in model architectures and systems. A few modalities compiled below:
- DNA: HyenaDNA paper and blog
- SSMs for Video
- SpaceTime: Effectively Modeling Time Series with Simple Discrete State Spaces [paper] [code], [demo]
- Recurrent Distance-Encoding Neural Networks for Graph Representation Learning
- Modeling Multivariate Biosignals With Graph Neural Networks and Structured State Space Models
- Self-Supervised Graph Neural Networks for Improved Electroencephalographic Seizure Analysis
- Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data
- scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis in Brain