A subjective learning guide for generative AI research including curated list of articles and projects
Generative AI is a hot topic today 🔥 and this roadmap is designed to help beginners quickly gain basic knowledge and find useful resources of Generative AI. Even experts are welcome to refer to this roadmap to recall old knowledge and develop new ideas.
- Background Knowledge
- Large Language Models (LLMs)
- Diffusion Models
- Large Multimodal Models (LMMs)
- Beyond Transformers
This section should help you learn or regain the basic knowledge of neural networks (e.g., backpropagation), get you familiar with the transformer architecture, and describe some common transformer-based models.
Are you very familiar with the following classic neural network structures?
📝 If so, you should be able to answer these questions:
- Why do CNNs work better than MLPs on images?
- Why do RNNs work better than MLPs on time-series data?
- What's the difference between GRU and LSTM?
Backpropagation (BP) is the base of NN training. You will not be an AI expert if you don't understand BP. There are many textbooks and online tutorials teaching BP, but unfortunately, most of them don't present formulas in vectorized/tensorized forms. The BP formula of an NN layer is indeed as neat as its forward pass formula. This is exactly how BP is implemented and should be implemented. To understand BP, please read the following materials:
- Neural Networks and Deep Learning [Chapter 3.2 especially 3.2.6]
- meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [Section 2.1]
- Resprop: Reuse sparsified backpropagation (CVPR 2020) [Section 3.1]
📝 If you understand BP, you should be able to answer these questions:
- How will you describe the BP of a convolutional layer?
- What is the ratio of the computing cost (i.e., number of floating point operations) between forward pass and backward pass of a dense layer?
- How will you describe the BP of an MLP with two dense layers sharing the same weight matrix?
Transformer is the base architecture of existing large generative models. It's necessary to understand every component in the transformer. Please read the following materials:
- Attention Is All You Need (NeurIPS 2017) [Original Paper]
- Transformer Explainer: Interactive Learning of Text-Generative Models [An Interactive Tutorial]
- An image is worth 16x16 words: Transformers for image recognition at scale (ICLR 2021) [Vision Transformer]
- Neural machine translation with a Transformer and Keras [Great Explanation for MultiHead Attention (MHA)]
- FLOPs of a Transformer Block [Let's practice calculating FLOPs]
- Fast Transformer Decoding: One Write-Head is All You Need [Multi-Query Attention (MQA)]
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints [Grouped-Query Attention (GQA)]
- Enhanced Transformer with Rotary Position Embedding [Understand Positional Embedding]
- Rotary Embeddings: A Relative Revolution [Understand Positional Embedding]
- Teacher Forcing vs Scheduled Sampling vs Normal Mode [Teacher Forcing in Transformer Training]
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [See section 3 - generative inference to learn how LLMs peform generation based on KV cache]
- Contextual Position Encoding: Learning to Count What’s Important [Context-dependent positional encoding]
📝 If you understand transformers, you should be able to answer these questions:
- What are the pros and cons of tranformers compared to RNNs?(simultaneously attending, training parallelism, complexity)
- Can you caculate the FLOPs of GQA? See when does it degrade to MHA and MQA?
- What is the motivation of MQA and GQA?
- What does the causal attention mask look like and why?
- How will you describe the training of decoder-only transformers step by step?
- Why is RoPE better than sinusoidal positional encoding?
- Learning transferable visual models from natural language supervision [CLIP]
- Emerging Properties in Self-Supervised Vision Transformers (ICCV 2021) [DINO]
- Masked autoencoders are scalable vision learners (CVPR 2022) [MAE]
- Scaling Vision with Sparse Mixture of Experts (NeurIPS 2021) [MoE]
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [MoD]
-
Einsum is easy and useful [A great tutorial for using einsum/einops]
-
Open-Endedness is Essential for Artificial Superhuman Intelligence (ICML 2024) [Thoughts on achieving superhuman AI]
-
Levels of AGI for Operationalizing Progress on the Path to AGI
LLMs are transformers. They can be categorized into encoder-only, encoder-decoder, and decoder-only architectures, as shown in the LLM evolutionary tree below [image source]. Check milestone papers of LLMs.
Encoder-only model can be used to extract sentence features but lacks generative power. Encoder-decoder and decoder-only models are used for text generation. In particular, most existing LLMs prefer decoder-only structures due to stronger repesentational power. Intuitively, encoder-decoder models can be considered a sparse version of decoder-only models and the information decays more from encoder to decoder. Check this paper for more details.
LLMs are typically pretrained from trillions of text tokens by model publishers to internalize the natural language structure. Today's model developers also conduct instructional fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to teach the model to follow human instructions and generate answers aligned with human preference. The users can then download the published model and finetune it on small personal datasets (e.g., movie dialog). Due to huge amount of data, pretraining requires massive computing resources (e.g., more than thousands of GPUs) which is unaffordable by individuals. On the other hand, fine-tuning is less resource-hungry and can be done with a few GPUs.
The following materials can help you understand the pretraining and fine-tuning process:
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [Pretraining and Finetuning of Encoder-only LLMs]
- Scaling Instruction-Finetuned Language Models [Pretraining and Instructional Finetuning]
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- Language Models are Few-Shot Learners [Decoder-only LLMs] [中文导读 by 李沐]
More tutorials can be found here.
Prompting techniques for LLMs involve crafting input text in a way that guides the model to generate desired responses or outputs. Here are the useful resources to help you write better prompts:
- [DAIR.AI] Prompt Engineering Guide
- Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model
- Awesome Deliberative Prompting - How to ask LLMs to produce reliable reasoning and make reason-responsive decisions
- AutoPrompt - An automated method based on gradient-guided search to create prompts for a diverse set of NLP tasks.
Evaluation tools for large language models help assess their performance, capabilities, and limitations across different tasks and datasets. Here are some common evaluation strategies:
-
Automatic Evaluation Metrics: These metrics assess model performance automatically without human intervention. Common metrics include:
- BLEU: Measures the similarity between generated text and reference text based on n-gram overlap.
- ROUGE: Evaluates text summarization by comparing overlapping n-grams between generated and reference summaries.
- Perplexity: Measures how well a language model predicts a sample of text. Lower perplexity indicates better performance. It is equivalent to the exponentiation of the cross-entropy between the data and model predictions.
- F1 Score: Measures the balance between precision and recall in tasks like text classification or named entity recognition.
-
Human Evaluation: Human judgment is essential for assessing the quality of generated text comprehensively. Common human evaluation methods include:
- Human Ratings: Human annotators rate generated text based on criteria such as fluency, coherence, relevance, and grammaticality.
- Crowdsourcing Platforms: Platforms like Amazon Mechanical Turk or Figure Eight facilitate large-scale human evaluation by crowdsourcing annotations.
- Expert Evaluation: Domain experts assess model outputs to gauge their suitability for specific applications or tasks.
-
Benchmark Datasets: Standardized datasets enable fair comparison of models across different tasks and domains. Examples include:
-
Model Analysis Tools: Tools for analyzing model behavior and performance include:
- Automated Interpretability - Code for automatically generating, simulating, and scoring explanations of neuron behavior
- LLM Visualization - Visualizing LLMs in low level.
- Attention Analysis - Analyzing attention maps from BERT transformer.
- Neuron Viewer - Tool for viewing neuron activations and explanations.
A complete list can be found here
Standard evaluation frameworks for existing LLMs include:
- lm-evaluation-harness - A framework for few-shot evaluation of language models.
- lighteval - a lightweight LLM evaluation suite that Hugging Face has been using internally.
- OLMO-eval - a repository for evaluating open language models.
- instruct-eval - This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
Dealing with long contexts poses a challenge for large language models due to limitations in memory and processing capacity. Existing techniques include:
- Efficient Transformers
- State Space Models
- Length Extrapolation
- Long Term Memory
A complete list can be found here
Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters:
- Prompt Tuning: The power of scale for parameter-efficient prompt tuning
- Prefix Tuning: Prefix-tuning: Optimizing continuous prompts for generation
- LoRA: Lora: Low-rank adaptation of large language models
- Towards a Unified View of Parameter-Efficient Transfer Learning
- LoRA Learns Less and Forgets Less
More work can be found in Huggingface PEFT paper collection and it's highly recommended to practice with HuggingFace PEFT API.
Model merging refers to merging two or more LLMs trained on different tasks into a single LLM. This technique aims to leverage the strengths and knowledge of different models to create a more robust and capable model. For example, a LLM for code generation and another LLM for math prolem solving can be merged together so that the merged model is capable of doing both code generation and math problem solving.
The model merging is intriguing because it can be effectively achieved with very simple and cheap algorithms (e.g., linear combination of model weights). Here are some representative papers and reading materials:
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
- Editing Models with Task Arithmetic
- Merge Large Language Models with mergekit
More papers about model merging can be found here
Accelerating decoding of LLMs is crucial for improving inference speed and efficiency, especially in real-time or latency-sensitive applications. Here are some representative work of speeding up decoding process of LLMs:
- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023 Oral)
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models (EMNLP 2023)
- Efficient Streaming Language Models with Attention Sinks
- SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- Better & Faster Large Language Models via Multi-token Prediction
- Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
More work about accelerating LLM decoding can be found via Link 1 and Link 2.
Knowledge editing aims to efficiently modify LLMs behaviors, such as reducing bias and revising learned correlations. It includes many topics such as knowledge localization and unlearning. Representative work includes:
- Memory-Based Model Editing at Scale (ICML 2022)
- Transformer-Patcher: One Mistake worth One Neuron (ICLR 2023)
- Massive Editing for Large Language Model via Meta Learning (ICLR 2024)
- A Unified Framework for Model Editing
- Transformer Feed-Forward Layers Are Key-Value Memories (EMNLP 2021)
- Mass-Editing Memory in a Transformer
More papers can be found here.
By receiving massive training, LLMs digest world knowledge and are able to follow input instructions precisely. With these amazing capabilities, LLMs can play as agents that are possible to autonomously (and collaboratively) solve complex tasks, or simulate human interactions. Here are some representative papers of LLM agents:
- Generative Agents: Interactive Simulacra of Human Behavior (UIST 2023) [LLMs simulate human society in video games]
- SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents (ICLR 2024) [LLMs simulate social interactions]
- Voyager: An Open-Ended Embodied Agent with Large Language Models [LLMs live in the Minecraft world]
- Large Language Models as Tool Makers (ICLR 2024) [LLMs create their own reusable tools (e.g., in python functions) for problem-solving]
- MetaGPT: Meta Programming for Multi-Agent Collaborative Framework [LLMs as a team for automated software development]
- WebArena: A Realistic Web Environment for Building Autonomous Agents (ICLR 2024) [LLMs use web applications]
- Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction [LLMs use mobile applications]
- HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face (NeurIPS 2023) [LLMs seek models in huggingface for problem-solving]
- AGENTGYM: Evolving Large Language Model-based Agents across Diverse Environments [Diverse interactive environments and tasks for LLM-based agents]
A complete list of papers, platforms, and evaluation tools can be found here.
- Your Transformer is Secretly Linear
- Not All Language Model Features Are Linear
- KAN or MLP: A Fairer Comparison
- Transformer Layers as Painters
- Vision language models are blind
LLMs face several open challenges that researchers and developers are actively working to address. These challenges include:
- Hallucination
- Model Compression
- Evaluation
- Reasoning
- Explainability
- Fairness
- Factuality
- Knowledge Integration
A complete list can be found here.
Diffusion models aim to approxmiate the probability distribution of a given data domain, and provide a way to generate samples from its approximated distribution. Their goals are similar to other popular generative models, such as VAE, GANs, and Normalizing Flows.
The working flow of diffusion models is featured with two process:
- Forward process (diffusion process): it progressively applies noise to the original input data step by step until the data completely becomes noise.
- Reverse process (denoising process): an NN model (e.g., CNN or tranformer) is trained to estimate the noise being applied in each step during the forward process. This trained NN model can then be used to generate data from noise input. Existing diffusion models can also accept other signals (e.g., text prompts from users) to condition the data generation.
Check this awesome blog and more introductory tutorials can be found here. Diffusion models can be used to generate images, audios, videos, and more, and there are many subfields related to diffusion models as shown below [image source]:
Here are some representative papers of diffusion models for image generation:
- High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022)
- Palette: Image-to-image diffusion models (SIGGRAPH 2022)
- Image Super-Resolution via Iterative Refinement
- Inpainting using Denoising Diffusion Probabilistic Models (CVPR 2022)
- Adding Conditional Control to Text-to-Image Diffusion Models (ICCV 2023)
More papers can be found here.
Here are some representative papers of diffusion models for video generation:
- Video Diffusion Models
- Flexible Diffusion Modeling of Long Videos (NeurIPS 2022)
- Scaling Latent Video Diffusion Models to Large Datasets
- I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
More papers can be found here.
Here are some representative papers of diffusion models for audio generation:
- Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech
- Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model
- Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
- EdiTTS: Score-based Editing for Controllable Text-to-Speech
- ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech
More papers can be found here.
Similar to other large generative models, diffusion models are also pretrained on large amount of web data (e.g., LAION-5B dataset) and consume massive computing resources. Users can download the released weights can further fine-tune the model on personal datasets.
Here are some representative papers of efficient fine-tuning of diffusion models:
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation (CVPR 2023)
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion (ICLR 2023)
- Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (cvpr 2023)
- Controlling Text-to-Image Diffusion by Orthogonal Finetuning (NeurIPS 2023)
More papers can be found here.
It's highly recommended to do some practice with Huggingface Diffusers API.
Here we talk about evaluation of diffusion models for image generation. Many existing image quality metrics can be applied.
- CLIP score: CLIP score measures the compatibility of image-caption pairs. Higher CLIP scores imply higher compatibility. CLIP score was found to have high correlation with human judgement.
- Fréchet Inception Distance (FID): FID aims to measure how similar are two datasets of images. It is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network
- CLIP directional similarity: It measures the consistency of the change between the two images (in CLIP space) with the change between the two image captions.
More image quality metrics and calculation tools can be found here.
Diffusion models require multiple forward steps over to generate data, which is expensive. Here are some representative papers of diffusion models for efficient generation:
- Gotta Go Fast When Generating Data with Score-Based Models
- Fast Sampling of Diffusion Models with Exponential Integrator
- Learning fast samplers for diffusion models by differentiating through sample quality
- Accelerating Diffusion Models via Early Stop of the Diffusion Process
More papers can be found here.
Here are some representative papers of knowledge editing for diffusion models:
- Erasing Concepts from Diffusion Models (ICCV 2023)
- Editing Massive Concepts in Text-to-Image Diffusion Models
- Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
More papers can be found here.
Here are some survey papers talking about the challenges faced by diffusion models.
- A Survey of Diffusion Based Image Generation Models
- A Survey on Video Diffusion Models
- State of the Art on Diffusion Models for Visual Computing
- Diffusion Models in NLP: A Survey
Typical LMMs are constructed by connecting and fine-tuning existing pretrained unimodal models. Some are also pretrained from scratch. Check how LMMs evolve in the image below [image source].
There are many different ways of contructing LMMs. Representative architectures include:
- Language Models are General-Purpose Interfaces
- Flamingo: A Visual Language Model for Few-Shot Learning (NeurIPS 2022)
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (ICML 2022)
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models (ICML 2023)
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
- Dense Connector for MLLMs
More papers can be found via Link 1 and Link 2.
By combining LMMs with robots, researchers aim to develop AI systems that can perceive, reason about, and act upon the world in a more natural and intuitive way, with potential applications spanning robotics, virtual assistants, autonomous vehicles, and beyond. Here are some representative work of realizing embodied AI with LMMs:
- RT-1: Robotics Transformer for Real-World Control at Scale
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- RT-H: Action Hierarchies Using Language
- PaLM-E: An Embodied Multimodal Language Model
- TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction
More papers can be found via Link 1 and Link 2.
Here are some popular simulators and datasets to evaluate LMMs performance for embodied AI:
- Habitat 3.0: An Embodied AI simulation platform for studying collaborative human-robot interaction tasks in home environments
- ProcTHOR-10K: 10K Interactive Household Environments for Embodied AI
- ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes
- LEGENT: Open Platform for Embodied Agents
- RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
More resources can be found here.
Here are some survey papers talking about open challenges for LMM-enabled embodied AI:
- The Rise and Potential of Large Language Model Based Agents: A Survey
- Vision-Language Navigation with Embodied Intelligence: A Survey
- A Survey of Embodied AI: From Simulators to Research Tasks
- A Survey on LLM-based Autonomous Agents
- Mindstorms in Natural Language-Based Societies of Mind
Researchers are trying to explore new models other than transformers. The efforts include implicitly structuring model parameters and defining new model architectures.
- Monarch Mixer: Revisiting BERT, Without Attention or MLPs
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Hyena Hierarchy: Towards Larger Convolutional Language Models
- RWKV: Reinventing RNNs for the Transformer Era
- Retentive Network: A Successor to Transformer for Large Language Models
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- KAN:Kolmogorov–Arnold Networks
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Here is an awesome tutorial for state space models.