Skip to content

kreasof-ai/LLM-course

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Advanced LLM Development: Scaling, Optimization, and Deployment

Course Goal: To provide learners with the advanced knowledge and practical skills needed to develop, optimize, and deploy state-of-the-art large language models (LLMs) based on cutting-edge research, with a focus on efficiency, scalability, and real-world applicability.

Prerequisites:

  • Successful completion of "Modern AI Development: From Transformers to Generative Models" or equivalent knowledge.
  • Strong proficiency in Python and PyTorch.
  • Solid understanding of Transformer architectures, generative models (including diffusion models and flow matching), and fine-tuning techniques.
  • Experience with distributed training and GPU optimization.
  • Familiarity with model evaluation metrics and methodologies.

Course Duration: Approximately 10-14 weeks, with each module taking roughly 1-2 weeks.

Tools:

  • Python (>= 3.9)
  • PyTorch (latest stable version, with a focus on new features relevant to the papers)
  • Hugging Face Transformers, Datasets, Accelerate, and Diffusers libraries (latest versions)
  • Potentially exploring other libraries for specific techniques like model merging.
  • Weights & Biases, Tensorboard, or other experiment tracking and visualization tools
  • Jupyter Notebooks/Google Colab/VS Code with Remote - SSH
  • Cloud computing platforms (e.g., AWS, GCP, Azure) for large-scale experiments

Curriculum Draft:

Module 1: Scaling Laws and Model Architecture (Week 1-2)

  • Topic 1.1: Scaling Laws for LLMs:
    • Deep dive into scaling laws for model size, data size, and compute budget. (Inspired by Llama 3, Gemma 2, and Qwen 2.5's approach to data scaling).
    • Analyzing the relationship between model size, training data, and performance.
    • Understanding the concept of compute-optimal models and deviations from it.
    • Exploring non-linear relationships in model scaling and implications for training.
    • Practical considerations of scaling laws, including challenges and limitations.
    • Hands-on: Implementing scaling law calculations in PyTorch and Hugging Face.
  • Topic 1.2: Advanced Transformer Architectures:
    • Recap of standard Transformer components (attention, FFN, etc.).
    • Grouped-Query Attention (GQA): In-depth study and implementation (Llama 3, Gemma 2).
      • Explore different variations of GQA (e.g., sliding window attention).
    • Rotary Position Embeddings (RoPE): Analysis and implementation, including modifications like dynamic scaling or increased base frequency (Llama 3, Qwen 2.5).
    • RMSNorm: Understanding the role of normalization and its implementation.
    • SwiGLU/GeGLU: Exploring different activation functions and their impact (Gemma 2).
    • Hands-on: Building custom Transformer blocks with GQA and RoPE in PyTorch.

Module 2: Mixture-of-Experts (MoE) Models (Week 3-4)

  • Topic 2.1: Introduction to MoE:
    • The MoE paradigm and its motivation.
    • Key concepts: sparse gating, expert routing, load balancing.
    • Deep Dive into DeepSeekMoE and Qwen2.5-Turbo/Plus as exemplars of different MoE approaches.
    • Benefits and challenges of MoE models (efficiency, scalability, stability).
  • Topic 2.2: MoE Architectures and Training:
    • Expert Design: Granularity of experts (e.g., FFN, attention blocks), number of experts, expert specialization (inspired by DeepSeek-V3).
    • Gating Mechanisms: Softmax gating, top-k routing, noisy top-k gating, and their analysis.
    • Load Balancing: Techniques for preventing expert collapse, including auxiliary loss functions and DeepSeek-V3's auxiliary-loss-free load balancing strategy.
    • Training Dynamics: Challenges in training MoE models (stability, convergence).
    • Hands-on: Implementing a basic MoE layer in PyTorch.
    • Hands-on: Implementing and evaluating different routing strategies, including the method used in Qwen 2.5.
    • Hands-on: Implementing and comparing auxiliary loss and auxiliary-loss-free balancing strategies.
  • Topic 2.3: Sparse Upcycling:
    • Concept of sparse upcycling as discussed in the DeepSeek-V3 paper.
    • Converting dense models to MoE models.
    • Strategies for initialization and training.

Module 3: Data: Curation, Quality, and Efficiency (Week 5-6)

  • Topic 3.1: Data Curation for LLMs:
    • Data sources and collection strategies (web data, books, code, etc.).
    • Data filtering and cleaning techniques (deduplication, quality filtering, safety filtering).
    • Analyzing data distributions and identifying biases.
    • Data security and privacy considerations (PII removal, de-identification).
    • Techniques used in Llama 3, Gemma 2, and Qwen 2.5 for data quality assurance.
    • Hands-on: Implementing data filtering pipelines using PyTorch and standard Python libraries.
  • Topic 3.2: Data Mixing and Sampling:
    • Strategies for creating effective data mixtures (e.g., balancing domains, languages).
    • Techniques for upsampling high-quality data (inspired by Llama 3 and Gemma 2).
    • Dynamic data mixing and its impact on training.
    • Data annealing and its role in improving performance.
    • Hands-on: Experimenting with different data mixing strategies using Hugging Face datasets.
  • Topic 3.3: Tokenization and Vocabulary:
    • Advanced tokenization methods (Byte-level BPE, SentencePiece, WordPiece).
    • Vocabulary size considerations and their impact on performance.
    • Handling multilingual data and code.
    • Customizing tokenizers for specific domains or tasks.
    • Hands-on: Training and evaluating custom tokenizers using Hugging Face tokenizers.
  • Topic 3.4: Synthetic Data Generation:
    • Techniques for generating synthetic data to augment training (e.g., back-translation, paraphrasing, prompt engineering).
    • Using LLMs for data generation (inspired by Llama 3, Qwen 2.5, and DeepSeek-V3).
    • Evaluating the quality and diversity of synthetic data.
    • Ensuring safety and alignment in synthetic data.

Module 4: Optimization and Regularization (Week 7)

  • Topic 4.1: Advanced Optimization Techniques:
    • Review of standard optimizers (Adam, AdamW).
    • Understanding and implementing Grouped-Query Attention (GQA) and its impact on memory and computation.
    • Memory-efficient optimizers (e.g., 8-bit optimizers).
    • Gradient checkpointing and other memory-saving techniques.
    • Optimizer state sharding and its benefits.
    • Learning rate schedules and their impact on convergence.
    • Hands-on: Implementing and comparing different optimizers and learning rate schedules in PyTorch.
    • Hands-on: Exploring memory optimization techniques with Hugging Face accelerate.
  • Topic 4.2: Numerical Stability and Precision:
    • Challenges of training with mixed precision (FP32, BF16, FP8).
    • Strategies for maintaining numerical stability (e.g., loss scaling, gradient clipping).
    • The use of FP8 in training (DeepSeek-V3) and its implementation details.
    • Techniques for mitigating the impact of quantization errors.
    • Hands-on: Implementing and evaluating FP8 training in PyTorch.
    • Hands-on: Implementing and analyzing different quantization strategies.
  • Topic 4.3: Regularization and Generalization:
    • Advanced regularization techniques (e.g., dropout, weight decay).
    • Data augmentation for LLMs.
    • Techniques for improving generalization and robustness (e.g., adversarial training).

Module 5: Fine-tuning and Alignment (Week 8-9)

  • Topic 5.1: Instruction Fine-Tuning:
    • Creating high-quality instruction datasets.
    • Techniques for instruction fine-tuning (supervised finetuning, rejection sampling).
    • Strategies for multi-task and multi-lingual fine-tuning.
    • Evaluating instruction following capabilities (e.g., using IFEval as in Llama 3).
    • Hands-on: Implementing instruction fine-tuning pipelines using PyTorch and Hugging Face.
  • Topic 5.2: Alignment with Human Preferences:
    • Reinforcement Learning from Human Feedback (RLHF) and its limitations.
    • Direct Preference Optimization (DPO) and its implementation (Llama 3).
    • Group Relative Policy Optimization (GRPO) as used in DeepSeek-V3.
    • Other alignment techniques (e.g., Constitutional AI).
    • Collecting and using preference data for alignment.
    • Hands-on: Implementing DPO in PyTorch.
    • Hands-on: Implementing and evaluating GRPO.
  • Topic 5.3: Safety and Responsibility:
    • Techniques for mitigating harmful, biased, or unsafe outputs.
    • Safety fine-tuning and alignment (Llama 3).
    • Implementing and evaluating safety filters.
    • Responsible use of LLMs and ethical considerations.
    • Developing and using tools like Llama Guard.
    • Hands-on: Experimenting with safety fine-tuning strategies.
  • Topic 5.4: Model Merging and Ensembling:
    • Techniques for merging multiple models or checkpoints.
    • Strategies for ensembling models to improve performance and robustness.
    • Exploring different model merging techniques (e.g., linear interpolation, task arithmetic).

Module 6: Specialized Capabilities and Applications (Week 10-11)

  • Topic 6.1: Long Context Modeling:
    • Techniques for extending the context window of LLMs (e.g., context length extension in Llama 3).
    • Analyzing the impact of context length on performance.
    • Strategies for efficient inference with long contexts.
    • Evaluating long context capabilities (e.g., using benchmarks like LongBench, ZeroScrolls).
    • Hands-on: Implementing and evaluating context length extension techniques.
  • Topic 6.2: Multilingual LLMs:
    • Training and fine-tuning LLMs for multiple languages (inspired by Llama 3 and Qwen 2.5).
    • Strategies for improving cross-lingual transfer.
    • Evaluating multilingual capabilities (e.g., using multilingual benchmarks).
    • Hands-on: Fine-tuning an LLM on multilingual data.
  • Topic 6.3: Code Generation and Mathematical Reasoning:
    • Specialized techniques for code generation and debugging (e.g., using execution feedback as in Llama 3).
    • Enhancing mathematical reasoning capabilities (e.g., using techniques from DeepSeek-V3, Qwen 2.5, and Gemma 2).
    • Evaluating code generation and math reasoning abilities.
    • Hands-on: Fine-tuning an LLM for code generation and math reasoning.
  • Topic 6.4: Tool Use and Augmentation:
    • Integrating LLMs with external tools (e.g., search engines, code interpreters).
    • Training LLMs to use tools effectively (Llama 3).
    • Evaluating tool use capabilities.
    • Implementing and evaluating retrieval-augmented generation (RAG).

Module 7: Deployment and Inference (Week 12)

  • Topic 7.1: Model Quantization and Optimization:
    • Techniques for quantizing LLMs (e.g., INT8, FP8 quantization).
    • Model pruning and other optimization strategies.
    • Optimizing inference for different hardware platforms (e.g., GPUs, TPUs).
    • Hands-on: Implementing and evaluating model quantization techniques.
  • Topic 7.2: Efficient Inference Strategies:
    • Techniques for improving inference speed (e.g., caching, speculative decoding).
    • Strategies for handling long sequences during inference.
    • Optimizing batch size and sequence length for different applications.
  • Topic 7.3: Serving and Deployment:
    • Deploying LLMs using serving frameworks (e.g., TorchServe, TensorFlow Serving).
    • Implementing system-level safety filters (e.g., Llama Guard).
    • Monitoring and logging deployed models.
    • Scaling LLM deployment to handle large numbers of requests.

Module 8: Emerging Trends and Future Directions (Week 13-14)

  • Topic 8.1: Multimodal LLMs:
    • Integrating vision and speech capabilities into LLMs (inspired by Llama 3's multimodal experiments).
    • Training and evaluating multimodal models.
    • Exploring different architectures for multimodal learning.
  • Topic 8.2: Advanced Reasoning and Planning:
    • Techniques for improving the reasoning and planning abilities of LLMs.
    • Integrating LLMs with symbolic reasoning systems.
    • Exploring methods for long-horizon planning and decision-making.
  • Topic 8.3: Research Frontiers:
    • Discussion of current research trends and open challenges in LLM development.
    • Exploring new architectures, training methods, and evaluation techniques.
    • Considering the ethical implications of advanced LLMs and their societal impact.
  • Topic 8.4: Final Project Presentations and Review
    • Students present their final projects.
    • Peer review and feedback.
    • Course wrap-up and discussion of future learning paths.

Assessment:

  • Weekly quizzes to test comprehension of key concepts.
  • Programming assignments involving implementation of core techniques.
  • Mid-term project: Fine-tuning an LLM for a specific task or domain, with a focus on optimization and efficiency.
  • Final project: Developing, training, evaluating, and potentially deploying an advanced LLM, incorporating concepts learned throughout the course. This could involve:
    • Implementing and evaluating a specific technique from one of the research papers.
    • Developing a novel approach to scaling, optimization, or alignment.
    • Building an application that leverages the capabilities of LLMs.
    • Creating a custom MoE model and exploring different training strategies.
    • Fine-tuning an LLM with a focus on safety and responsibility, including the implementation of safety filters and evaluation of the model's behavior.

Pedagogical Considerations:

  • Hands-on, Project-Based Learning: The course will emphasize practical implementation and experimentation, with a strong focus on building and evaluating models.
  • Research-Driven: The curriculum will be closely aligned with cutting-edge research, exposing learners to the latest techniques and challenges in LLM development.
  • Collaborative Learning: Encourage students to collaborate on projects and share their findings.
  • Ethical Considerations: Integrate discussions on the ethical implications of LLMs throughout the course.
  • Focus on Efficiency and Scalability: Emphasize techniques for optimizing model training and inference, and for scaling models to handle large datasets and complex tasks.
  • In-depth analysis: Encourage students to critically analyze research papers and understand the nuances of different approaches.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published