Course Goal: To provide learners with the advanced knowledge and practical skills needed to develop, optimize, and deploy state-of-the-art large language models (LLMs) based on cutting-edge research, with a focus on efficiency, scalability, and real-world applicability.
Prerequisites:
- Successful completion of "Modern AI Development: From Transformers to Generative Models" or equivalent knowledge.
- Strong proficiency in Python and PyTorch.
- Solid understanding of Transformer architectures, generative models (including diffusion models and flow matching), and fine-tuning techniques.
- Experience with distributed training and GPU optimization.
- Familiarity with model evaluation metrics and methodologies.
Course Duration: Approximately 10-14 weeks, with each module taking roughly 1-2 weeks.
Tools:
- Python (>= 3.9)
- PyTorch (latest stable version, with a focus on new features relevant to the papers)
- Hugging Face Transformers, Datasets, Accelerate, and Diffusers libraries (latest versions)
- Potentially exploring other libraries for specific techniques like model merging.
- Weights & Biases, Tensorboard, or other experiment tracking and visualization tools
- Jupyter Notebooks/Google Colab/VS Code with Remote - SSH
- Cloud computing platforms (e.g., AWS, GCP, Azure) for large-scale experiments
Curriculum Draft:
Module 1: Scaling Laws and Model Architecture (Week 1-2)
- Topic 1.1: Scaling Laws for LLMs:
- Deep dive into scaling laws for model size, data size, and compute budget. (Inspired by Llama 3, Gemma 2, and Qwen 2.5's approach to data scaling).
- Analyzing the relationship between model size, training data, and performance.
- Understanding the concept of compute-optimal models and deviations from it.
- Exploring non-linear relationships in model scaling and implications for training.
- Practical considerations of scaling laws, including challenges and limitations.
- Hands-on: Implementing scaling law calculations in PyTorch and Hugging Face.
- Topic 1.2: Advanced Transformer Architectures:
- Recap of standard Transformer components (attention, FFN, etc.).
- Grouped-Query Attention (GQA): In-depth study and implementation (Llama 3, Gemma 2).
- Explore different variations of GQA (e.g., sliding window attention).
- Rotary Position Embeddings (RoPE): Analysis and implementation, including modifications like dynamic scaling or increased base frequency (Llama 3, Qwen 2.5).
- RMSNorm: Understanding the role of normalization and its implementation.
- SwiGLU/GeGLU: Exploring different activation functions and their impact (Gemma 2).
- Hands-on: Building custom Transformer blocks with GQA and RoPE in PyTorch.
Module 2: Mixture-of-Experts (MoE) Models (Week 3-4)
- Topic 2.1: Introduction to MoE:
- The MoE paradigm and its motivation.
- Key concepts: sparse gating, expert routing, load balancing.
- Deep Dive into DeepSeekMoE and Qwen2.5-Turbo/Plus as exemplars of different MoE approaches.
- Benefits and challenges of MoE models (efficiency, scalability, stability).
- Topic 2.2: MoE Architectures and Training:
- Expert Design: Granularity of experts (e.g., FFN, attention blocks), number of experts, expert specialization (inspired by DeepSeek-V3).
- Gating Mechanisms: Softmax gating, top-k routing, noisy top-k gating, and their analysis.
- Load Balancing: Techniques for preventing expert collapse, including auxiliary loss functions and DeepSeek-V3's auxiliary-loss-free load balancing strategy.
- Training Dynamics: Challenges in training MoE models (stability, convergence).
- Hands-on: Implementing a basic MoE layer in PyTorch.
- Hands-on: Implementing and evaluating different routing strategies, including the method used in Qwen 2.5.
- Hands-on: Implementing and comparing auxiliary loss and auxiliary-loss-free balancing strategies.
- Topic 2.3: Sparse Upcycling:
- Concept of sparse upcycling as discussed in the DeepSeek-V3 paper.
- Converting dense models to MoE models.
- Strategies for initialization and training.
Module 3: Data: Curation, Quality, and Efficiency (Week 5-6)
- Topic 3.1: Data Curation for LLMs:
- Data sources and collection strategies (web data, books, code, etc.).
- Data filtering and cleaning techniques (deduplication, quality filtering, safety filtering).
- Analyzing data distributions and identifying biases.
- Data security and privacy considerations (PII removal, de-identification).
- Techniques used in Llama 3, Gemma 2, and Qwen 2.5 for data quality assurance.
- Hands-on: Implementing data filtering pipelines using PyTorch and standard Python libraries.
- Topic 3.2: Data Mixing and Sampling:
- Strategies for creating effective data mixtures (e.g., balancing domains, languages).
- Techniques for upsampling high-quality data (inspired by Llama 3 and Gemma 2).
- Dynamic data mixing and its impact on training.
- Data annealing and its role in improving performance.
- Hands-on: Experimenting with different data mixing strategies using Hugging Face
datasets
.
- Topic 3.3: Tokenization and Vocabulary:
- Advanced tokenization methods (Byte-level BPE, SentencePiece, WordPiece).
- Vocabulary size considerations and their impact on performance.
- Handling multilingual data and code.
- Customizing tokenizers for specific domains or tasks.
- Hands-on: Training and evaluating custom tokenizers using Hugging Face
tokenizers
.
- Topic 3.4: Synthetic Data Generation:
- Techniques for generating synthetic data to augment training (e.g., back-translation, paraphrasing, prompt engineering).
- Using LLMs for data generation (inspired by Llama 3, Qwen 2.5, and DeepSeek-V3).
- Evaluating the quality and diversity of synthetic data.
- Ensuring safety and alignment in synthetic data.
Module 4: Optimization and Regularization (Week 7)
- Topic 4.1: Advanced Optimization Techniques:
- Review of standard optimizers (Adam, AdamW).
- Understanding and implementing Grouped-Query Attention (GQA) and its impact on memory and computation.
- Memory-efficient optimizers (e.g., 8-bit optimizers).
- Gradient checkpointing and other memory-saving techniques.
- Optimizer state sharding and its benefits.
- Learning rate schedules and their impact on convergence.
- Hands-on: Implementing and comparing different optimizers and learning rate schedules in PyTorch.
- Hands-on: Exploring memory optimization techniques with Hugging Face
accelerate
.
- Topic 4.2: Numerical Stability and Precision:
- Challenges of training with mixed precision (FP32, BF16, FP8).
- Strategies for maintaining numerical stability (e.g., loss scaling, gradient clipping).
- The use of FP8 in training (DeepSeek-V3) and its implementation details.
- Techniques for mitigating the impact of quantization errors.
- Hands-on: Implementing and evaluating FP8 training in PyTorch.
- Hands-on: Implementing and analyzing different quantization strategies.
- Topic 4.3: Regularization and Generalization:
- Advanced regularization techniques (e.g., dropout, weight decay).
- Data augmentation for LLMs.
- Techniques for improving generalization and robustness (e.g., adversarial training).
Module 5: Fine-tuning and Alignment (Week 8-9)
- Topic 5.1: Instruction Fine-Tuning:
- Creating high-quality instruction datasets.
- Techniques for instruction fine-tuning (supervised finetuning, rejection sampling).
- Strategies for multi-task and multi-lingual fine-tuning.
- Evaluating instruction following capabilities (e.g., using IFEval as in Llama 3).
- Hands-on: Implementing instruction fine-tuning pipelines using PyTorch and Hugging Face.
- Topic 5.2: Alignment with Human Preferences:
- Reinforcement Learning from Human Feedback (RLHF) and its limitations.
- Direct Preference Optimization (DPO) and its implementation (Llama 3).
- Group Relative Policy Optimization (GRPO) as used in DeepSeek-V3.
- Other alignment techniques (e.g., Constitutional AI).
- Collecting and using preference data for alignment.
- Hands-on: Implementing DPO in PyTorch.
- Hands-on: Implementing and evaluating GRPO.
- Topic 5.3: Safety and Responsibility:
- Techniques for mitigating harmful, biased, or unsafe outputs.
- Safety fine-tuning and alignment (Llama 3).
- Implementing and evaluating safety filters.
- Responsible use of LLMs and ethical considerations.
- Developing and using tools like Llama Guard.
- Hands-on: Experimenting with safety fine-tuning strategies.
- Topic 5.4: Model Merging and Ensembling:
- Techniques for merging multiple models or checkpoints.
- Strategies for ensembling models to improve performance and robustness.
- Exploring different model merging techniques (e.g., linear interpolation, task arithmetic).
Module 6: Specialized Capabilities and Applications (Week 10-11)
- Topic 6.1: Long Context Modeling:
- Techniques for extending the context window of LLMs (e.g., context length extension in Llama 3).
- Analyzing the impact of context length on performance.
- Strategies for efficient inference with long contexts.
- Evaluating long context capabilities (e.g., using benchmarks like LongBench, ZeroScrolls).
- Hands-on: Implementing and evaluating context length extension techniques.
- Topic 6.2: Multilingual LLMs:
- Training and fine-tuning LLMs for multiple languages (inspired by Llama 3 and Qwen 2.5).
- Strategies for improving cross-lingual transfer.
- Evaluating multilingual capabilities (e.g., using multilingual benchmarks).
- Hands-on: Fine-tuning an LLM on multilingual data.
- Topic 6.3: Code Generation and Mathematical Reasoning:
- Specialized techniques for code generation and debugging (e.g., using execution feedback as in Llama 3).
- Enhancing mathematical reasoning capabilities (e.g., using techniques from DeepSeek-V3, Qwen 2.5, and Gemma 2).
- Evaluating code generation and math reasoning abilities.
- Hands-on: Fine-tuning an LLM for code generation and math reasoning.
- Topic 6.4: Tool Use and Augmentation:
- Integrating LLMs with external tools (e.g., search engines, code interpreters).
- Training LLMs to use tools effectively (Llama 3).
- Evaluating tool use capabilities.
- Implementing and evaluating retrieval-augmented generation (RAG).
Module 7: Deployment and Inference (Week 12)
- Topic 7.1: Model Quantization and Optimization:
- Techniques for quantizing LLMs (e.g., INT8, FP8 quantization).
- Model pruning and other optimization strategies.
- Optimizing inference for different hardware platforms (e.g., GPUs, TPUs).
- Hands-on: Implementing and evaluating model quantization techniques.
- Topic 7.2: Efficient Inference Strategies:
- Techniques for improving inference speed (e.g., caching, speculative decoding).
- Strategies for handling long sequences during inference.
- Optimizing batch size and sequence length for different applications.
- Topic 7.3: Serving and Deployment:
- Deploying LLMs using serving frameworks (e.g., TorchServe, TensorFlow Serving).
- Implementing system-level safety filters (e.g., Llama Guard).
- Monitoring and logging deployed models.
- Scaling LLM deployment to handle large numbers of requests.
Module 8: Emerging Trends and Future Directions (Week 13-14)
- Topic 8.1: Multimodal LLMs:
- Integrating vision and speech capabilities into LLMs (inspired by Llama 3's multimodal experiments).
- Training and evaluating multimodal models.
- Exploring different architectures for multimodal learning.
- Topic 8.2: Advanced Reasoning and Planning:
- Techniques for improving the reasoning and planning abilities of LLMs.
- Integrating LLMs with symbolic reasoning systems.
- Exploring methods for long-horizon planning and decision-making.
- Topic 8.3: Research Frontiers:
- Discussion of current research trends and open challenges in LLM development.
- Exploring new architectures, training methods, and evaluation techniques.
- Considering the ethical implications of advanced LLMs and their societal impact.
- Topic 8.4: Final Project Presentations and Review
- Students present their final projects.
- Peer review and feedback.
- Course wrap-up and discussion of future learning paths.
Assessment:
- Weekly quizzes to test comprehension of key concepts.
- Programming assignments involving implementation of core techniques.
- Mid-term project: Fine-tuning an LLM for a specific task or domain, with a focus on optimization and efficiency.
- Final project: Developing, training, evaluating, and potentially deploying an advanced LLM, incorporating concepts learned throughout the course. This could involve:
- Implementing and evaluating a specific technique from one of the research papers.
- Developing a novel approach to scaling, optimization, or alignment.
- Building an application that leverages the capabilities of LLMs.
- Creating a custom MoE model and exploring different training strategies.
- Fine-tuning an LLM with a focus on safety and responsibility, including the implementation of safety filters and evaluation of the model's behavior.
Pedagogical Considerations:
- Hands-on, Project-Based Learning: The course will emphasize practical implementation and experimentation, with a strong focus on building and evaluating models.
- Research-Driven: The curriculum will be closely aligned with cutting-edge research, exposing learners to the latest techniques and challenges in LLM development.
- Collaborative Learning: Encourage students to collaborate on projects and share their findings.
- Ethical Considerations: Integrate discussions on the ethical implications of LLMs throughout the course.
- Focus on Efficiency and Scalability: Emphasize techniques for optimizing model training and inference, and for scaling models to handle large datasets and complex tasks.
- In-depth analysis: Encourage students to critically analyze research papers and understand the nuances of different approaches.