Educational implementations of DeepSeek-V3.2 and DeepSeek-R1 architectures in Rust (using Candle) and Python (using PyTorch/MLX).
This repository provides from-scratch implementations of the key innovations that make DeepSeek models state-of-the-art:
- Multi-Query Attention (MQA) - Single KV head for memory-efficient inference
- Grouped-Query Attention (GQA) - Balanced KV sharing across head groups
- Multi-Head Latent Attention (MLA) - Compressed KV cache for efficient inference
- DeepSeek Sparse Attention (DSA) - Hybrid local + dilated global attention patterns
- Standard MoE - Top-k expert routing with load balancing
- DeepSeek MoE - Fine-grained experts with shared expert isolation
- 256-Expert MoE - Hierarchical routing for massive expert scaling
- Multi-Token Prediction (MTP) - Predict multiple future tokens simultaneously
- FP8 Mixed-Precision - Low-precision training with dynamic scaling
- FP8 Quantization - Simulated 8-bit inference for deployment
- GRPO Training - Group Relative Policy Optimization for RL
- DPO Training - Direct Preference Optimization
- SFT Pipeline - Supervised Fine-Tuning infrastructure
- Knowledge Distillation - Teacher-student model compression
- Agent & Tool-Use Training - Function calling and tool integration
- 5D Parallelism - Tensor, Pipeline, Data, Expert, and Sequence parallelism
- ZeRO Optimization - Memory-efficient distributed training
- DeepSeek-R1 Reasoning - Chain-of-thought reasoning with
<think>tags - Modal Cloud GPUs - Distributed training on A100/H100 GPUs
- Quick Start
- Prerequisites
- Installation
- Docker Setup
- Training Guide
- Ablation Studies
- Performance Benchmarks
- Project Structure
- Architecture Documentation
- Reproducibility
- Development
- FAQ
- Contributing
- License
- References
# 1. Clone and setup
git clone https://github.com/DevJadhav/deepseek-from-scratch.git
cd DeepSeek-From-Scratch
# 2. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh # Install UV if needed
uv sync
# 3. Download training data
uv run python scripts/download_tinystories.py
# 4. Train! (Choose one option)
# Option A: Local MLX (Apple Silicon - fastest for local dev)
uv run python -m deepseek.pipeline.cli run --backend mlx --max-steps 1000
# Option B: Modal Cloud GPU (Recommended for production)
uv pip install modal && uv run modal setup
uv run modal run src/deepseek/cloud/modal/ray_cluster.py::run_pytorch --scale initial --max-steps 1000
# Option C: Local PyTorch (CPU/CUDA)
uv run python -m deepseek.pipeline.cli run --backend pytorch --model-size tiny --max-steps 1000# PyTorch demos (CUDA/MPS/CPU)
uv run python -m deepseek.torch.main
# MLX demos (Apple Silicon native)
uv run python -m deepseek.mlx.main
uv run python -m deepseek.mlx.benchmark
# Rust demos (Metal)
cd rust-src
cargo run --release- macOS 12.3+ (for Metal/MPS) or Linux with CUDA
- Apple Silicon (M1/M2/M3/M4) recommended for best local performance
- 8GB+ RAM recommended (16GB+ for larger models)
| Tool | Purpose | Installation |
|---|---|---|
| Python 3.10+ | Python implementation | python.org |
| UV | Fast Python package manager | curl -LsSf https://astral.sh/uv/install.sh | sh |
| Rust | Rust implementation | rustup.rs |
| Modal (optional) | Cloud GPU training | pip install modal && modal setup |
cd DeepSeek-From-Scratch
# Install with UV (fastest)
uv sync
# Or install with all optional extras
uv sync --all-extras # Includes MLX, CoreML, dev toolsAlternative (pip):
pip install torch numpy einops transformers
pip install mlx # Optional: Apple Silicon only
pip install coremltools # Optional: CoreML exportcd rust-src
# Build in release mode (Metal backend on macOS)
cargo build --release
# Build with CUDA support (Linux with NVIDIA GPU)
cargo build --release --features cuda
# Run tests
cargo testThe easiest way to get started is with VS Code Dev Containers:
- Install Docker Desktop
- Install VS Code Remote - Containers extension
- Open this folder in VS Code
- Click "Reopen in Container" when prompted
# Development environment
docker compose up deepseek-dev
# Training with GPU
docker compose up deepseek-training
# Multi-GPU training (4 GPUs)
docker compose up deepseek-multi-gpu
# TensorBoard monitoring
docker compose up tensorboard
# Visit http://localhost:6006
# Jupyter notebooks
docker compose up jupyter
# Visit http://localhost:8888# Build the image
docker build -t deepseek-from-scratch .
# Run with GPU support
docker run --gpus all -it deepseek-from-scratch
# Run with volume mounts
docker run --gpus all -v $(pwd)/data:/app/data -v $(pwd)/checkpoints:/app/checkpoints deepseek-from-scratch# Download TinyStories dataset
uv run python scripts/download_tinystories.py
# Data saved to: data/stories/Best for: Production training, large-scale experiments
# Setup Modal (one-time)
uv pip install modal
uv run modal setup
# Run multi-GPU distributed training with PyTorch (8 A100 GPUs)
uv run modal run src/deepseek/cloud/modal/ray_cluster.py::run_pytorch --scale initial --max-steps 1000
# Run Rust backend verification (with CUDA)
uv run modal run src/deepseek/cloud/modal/ray_cluster.py::run_rust --scale initial --max-steps 100
# Pipeline CLI (alternative interface)
uv run python -m deepseek.pipeline.cli run --backend rust --gpus 3 --pp-size 3 --max-steps 3000
uv run python -m deepseek.pipeline.cli run --backend pytorch --gpus 3 --pp-size 3 --max-steps 3000GPU Configuration: Standard config uses 8 A100-40GB GPUs with DualPipe (TP=2, PP=2, DP=2). For scaled runs (>8 GPUs), the framework automatically orchestrates sequential 8-GPU batches with checkpointing for fault tolerance.
Best for: Local development, quick iterations on Mac
# Memory-conscious config
uv run python -m deepseek.pipeline.cli run --backend mlx --max-steps 1500 --batch-size 2 --d-model 128
# Full config
uv run python -m deepseek.pipeline.cli run --backend mlx --model-size tiny --max-steps 5000Best for: Linux with CUDA, debugging
uv run python -m deepseek.pipeline.cli run --backend pytorch --model-size tiny --max-steps 1000The pipeline orchestrates a complete training workflow:
DATA_PREP β PRETRAIN β SFT β GRPO β DISTILLATION β EXPORT
| Stage | Description |
|---|---|
| DATA_PREP | Tokenize and shard dataset |
| PRETRAIN | MTP + MoE pretraining |
| SFT | Supervised Fine-Tuning (instruction tuning) |
| GRPO | Group Relative Policy Optimization (alignment) |
| DISTILLATION | Knowledge distillation (optional) |
| EXPORT | Save final model + config |
The framework implements DeepSeek-style 5D parallelism:
| Dimension | Description | Default |
|---|---|---|
| PP (Pipeline) | Splits model layers across GPUs | 3 |
| DP (Data) | Replicates model, splits data | 1 |
| TP (Tensor) | Splits layers horizontally | 1 |
| EP (Expert) | Distributes MoE experts | 1 |
| SP (Sequence) | Splits long sequences | 1 |
Pipeline Parallelism Architecture (PP=3):
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β GPU 0 ββββΆβ GPU 1 ββββΆβ GPU 2 β
β Embed+L1-4 β β L5-8 β β L9-12+Head β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β² β
βββββββββββ Gradient Flow βββββββββββββ
# Export to GGUF format
uv run python scripts/export_gguf.py --checkpoint checkpoints/final
# Export to CoreML (iOS/macOS)
uv run python deepseek-from-scratch-python/export_coreml.pyuv run python scripts/inference.py --checkpoint checkpoints/final --prompt "Once upon a time"We provide comprehensive ablation study infrastructure to analyze component contributions.
# Run all ablations
uv run python scripts/ablation/run_all_ablations.py --output-dir results/ablations
# Individual ablations
uv run python scripts/ablation/run_attention_ablation.py # MLA vs GQA vs MHA
uv run python scripts/ablation/run_expert_ablation.py # 8 vs 64 vs 256 experts
uv run python scripts/ablation/run_balancing_ablation.py # Aux-loss-free vs aux loss
uv run python scripts/ablation/run_mtp_ablation.py # MTP depth D=0,1,2,3
uv run python scripts/ablation/run_precision_ablation.py # FP8 vs BF16 vs FP16| Study | Best Configuration | Key Finding |
|---|---|---|
| Attention | MLA | 14Γ KV cache compression, no quality loss |
| Experts | 256 with K=8 | Diminishing returns beyond 256 |
| Balancing | Aux-loss-free | Cleaner gradients, better convergence |
| MTP Depth | D=1 | 1.4Γ speculative decoding speedup |
| Precision | FP8 per-block | 2.4Γ throughput, minimal accuracy loss |
| Backend | Hardware | Time | Steps/sec | Final Loss |
|---|---|---|---|---|
| Rust+GPU | 3Γ H100 80GB | ~4 min | 13.5 | 1.18 |
| Python+GPU | 3Γ H100 80GB | ~5 min | 10.2 | 1.37 |
| MLX | Apple M1/M2/M3 | ~15 min | 3.3 | 1.85 |
Test Config: batch_size=4, seq_len=64, d_model=512
| Component | Rust (Metal) | Python (MPS) | MLX |
|---|---|---|---|
| MQA | 11.75ms | 0.95ms | 0.73ms |
| GQA | 11.00ms | 0.54ms | 0.82ms |
| MLA | 10.74ms | 0.96ms | 0.97ms |
| Component | Rust (Metal) | Python (MPS) | MLX |
|---|---|---|---|
| Standard MoE | 5.94ms | 134.87ms | - |
| DeepSeek MoE | 4.97ms | 49.85ms | 2.53ms |
| Component | Rust (Metal) | Python (MPS) | MLX |
|---|---|---|---|
| GRPO Loss | 0.04ms | 0.73ms | 0.66ms |
| DPO Loss | 0.01ms | 0.28ms | 1.08ms |
| KD Loss | 0.05ms | 0.61ms | 0.32ms |
# PyTorch benchmarks (CUDA/MPS/CPU)
cd deepseek-from-scratch-python
uv run python -m pytest tests/ -v
# MLX benchmarks (Apple Silicon native)
uv run python mlx_impl/benchmark.py
# Rust benchmarks (Metal)
cd Deepseek-from-scratch-in-rust
cargo run --releaseDeepSeek-From-Scratch/
βββ README.md # This file
βββ LICENSE # Apache 2.0 License
βββ pyproject.toml # Python dependencies
βββ uv.lock # Locked dependencies
β
βββ src/deepseek/ # Main Python package
β βββ torch/ # PyTorch implementation (CUDA/MPS/CPU)
β β βββ model/ # Model components (attention, moe, mla, transformer)
β β βββ training/ # Training infrastructure (grpo, sft, fsdp)
β β βββ kernels/ # Triton kernels
β β βββ utils/ # Utilities
β β
β βββ mlx/ # MLX implementation (Apple Silicon native)
β β βββ attention.py # MQA, GQA, MLA
β β βββ moe.py # MoE implementations
β β βββ grpo.py # GRPO training
β β βββ r1.py # DeepSeek-R1 reasoning
β β βββ ane_impl/ # Apple Neural Engine optimizations
β β
β βββ pipeline/ # Ray training orchestration
β β βββ cli.py # Command-line interface
β β βββ config.py # Configuration
β β βββ workflow.py # Ray Workflow DAG
β β βββ stages/ # Pipeline stages (pretrain, sft, grpo)
β β βββ runners/ # Backend runners (mlx, pytorch, rust, modal)
β β
β βββ cloud/modal/ # Modal cloud GPU integration
β β βββ app.py # Modal app definition
β β βββ config.py # 5D parallelism config
β β βββ distributed_trainer.py
β β
β βββ common/ # Shared utilities
β βββ tracking/ # Profiling and W&B integration
β
βββ rust-src/ # Rust/Candle implementation (Metal)
β βββ Cargo.toml # Rust dependencies
β βββ src/
β βββ main.rs # Entry point
β βββ model/ # Model components
β βββ training/ # Training infrastructure
β
βββ config/ # Configuration files
β βββ tiny_mlx_*.json # MLX training configs
β βββ hydra/ # Hydra configuration
β
βββ tests/ # Test suite
β βββ torch/ # PyTorch backend tests
β βββ mlx/ # MLX backend tests
β βββ ane/ # Apple Neural Engine tests
β βββ pipeline/ # Pipeline tests
β βββ cloud/ # Cloud integration tests
β
βββ docs/ # Architecture documentation (22+ files)
β
βββ scripts/ # Utility scripts
β βββ download_tinystories.py # Download training data
β βββ export_gguf.py # GGUF export
β βββ inference.py # Run inference
β βββ train_tiny.py # Quick training script
β
βββ monitoring/ # Cost tracking and dashboards
β
βββ checkpoints/ # Model checkpoints (reproducibility examples)
The docs/ directory contains in-depth explanations of all architectural components:
- Multi-Query Attention (MQA)
- Grouped-Query Attention (GQA)
- Multi-Head Latent Attention (MLA)
- DeepSeek Attention
- Multi-Latent Attention Deep Dive
- Auxiliary-Loss-Free Load Balancing
- DualPipe Pipeline Parallelism
- Expert Specialization Analysis
- From Scratch to Production
For complete reproduction instructions, see REPRODUCIBILITY.md.
# 1. Setup environment
uv sync
# 2. Download data
uv run python scripts/download_tinystories.py
# 3. Train with specific seed for reproducibility
uv run python scripts/train_tiny.py --seed 42 --max-steps 1000
# 4. Run benchmarks
uv run python scripts/benchmark.py --config configs/tiny_test.json| Metric | Expected | Tolerance |
|---|---|---|
| Training Loss | ~2.5 | Β±0.2 |
| Validation Loss | ~2.7 | Β±0.2 |
| Throughput (M1) | ~3K tok/s | Β±500 |
| Memory (tiny) | ~1GB | Β±200MB |
Last Verified: December 5, 2025
| Component | Status | Tests | Notes |
|---|---|---|---|
| Python (uv) | β Passing | 1,221 passed, 50 skipped | Full test suite |
| Rust (Candle) | β Passing | 302 passed, 17 ignored | Metal backend |
| PyTorch Backend | β Working | All tests pass | MPS/CPU |
| MLX Backend | β Working | All tests pass | Apple Silicon |
| ANE Backend | β Working | All tests pass | Neural Engine |
| Triton Kernels | β Working | All tests pass | Requires CUDA |
| CUDA Backend | β Working | All tests pass | Requires NVIDIA GPU |
Package Manager: uv v0.7.8
Python Version: 3.12.10
Rust Edition: 2021
# Python tests (full suite)
uv run pytest tests/ -v
# Python tests by backend
uv run pytest tests/torch/ -v # PyTorch backend
uv run pytest tests/mlx/ -v # MLX backend
uv run pytest tests/ane/ -v # Apple Neural Engine
uv run pytest tests/pipeline/ -v # Pipeline orchestration
# Rust tests
cd rust-src
cargo test
# Rust tests with CUDA (on NVIDIA systems)
cd rust-src
cargo test --features cuda# Python
uv run black .
uv run ruff check .
# Rust
cargo fmt
cargo clippy# Python
uv run mypy ray_pipeline/Q: What's the difference between PyTorch, MLX, and Rust implementations? A:
- PyTorch: Most complete, supports CUDA/MPS/CPU, best for research
- MLX: Optimized for Apple Silicon, fastest on Mac
- Rust: Best performance on Metal, best for production deployment
Q: Do I need a GPU to run this? A: No! All implementations support CPU. However, for training:
- Apple Silicon: MLX provides excellent performance
- NVIDIA GPU: PyTorch with CUDA is recommended
- Production: Rust with Metal or CUDA
Q: How much memory do I need? A: For the tiny model (~10M params):
- Minimum: 4GB RAM
- Recommended: 8GB+ RAM
- Full training: 16GB+ RAM or GPU memory
Q: Why is my training loss not decreasing? A: Common causes:
- Learning rate too high - try reducing by 10x
- Data not properly tokenized - check data pipeline
- Gradient explosion - enable gradient clipping
Q: How do I resume training from a checkpoint? A:
uv run python scripts/train_tiny.py --resume checkpoints/step_500Q: How do I train on my own dataset?
A: See the data preparation guide in docs/11-training-pipeline.md. Key steps:
- Tokenize your data
- Create training shards
- Update config to point to your data
Q: What is Multi-Latent Attention (MLA)?
A: MLA compresses the KV cache by projecting keys and values to a lower-dimensional latent space before storage. This reduces memory by 14Γ compared to standard attention while maintaining quality. See docs/03-multi-head-latent-attention.md.
Q: How does auxiliary-loss-free balancing work?
A: Instead of adding a loss term that affects gradients, we use learnable biases that only affect routing decisions (not gating weights). After each step, biases are adjusted based on load. See docs/blog/02_auxiliary_loss_free.md.
Q: Why use FP8 instead of FP16/BF16? A: FP8 provides:
- 2Γ memory reduction vs FP16
- 2-4Γ throughput improvement on modern hardware
- Minimal accuracy loss with per-block scaling
Contributions are welcome! Please see our Contributing Guidelines for details.
We also have a Code of Conduct that we expect all contributors to follow.
- Flash Attention integration
- KV-Cache implementation
- Real FP8 hardware kernels
- Distributed training improvements
- Model weight loading from HuggingFace
- Additional cloud GPU providers (RunPod, Lambda Labs)
- Documentation improvements
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use this project in your research, please cite:
@software{deepseek_from_scratch,
title={DeepSeek From Scratch: Educational Implementation of DeepSeek-V3},
author={Jadhav, Dev},
year={2024},
url={https://github.com/DevJadhav/deepseek-from-scratch},
license={Apache-2.0},
note={Educational implementation of DeepSeek-V3 architecture including MLA, MoE, and MTP}
}Also consider citing the original DeepSeek papers:
@article{deepseek_v3,
title={DeepSeek-V3 Technical Report},
author={DeepSeek-AI},
journal={arXiv preprint arXiv:2412.19437},
year={2024}
}
@article{deepseek_r1,
title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
author={DeepSeek-AI},
journal={arXiv preprint arXiv:2501.12948},
year={2025}
}- DeepSeek-V3 Technical Report
- DeepSeek-R1 Technical Report
- Candle ML Framework
- MLX Framework
- Modal Cloud Platform
- Ray Framework
See CHANGELOG.md for a detailed history of changes.
This project is for educational purposes, demonstrating the key architectural innovations in DeepSeek models. Special thanks to:
- DeepSeek AI for their open research and technical reports
- Hugging Face for the Candle framework
- Apple for the MLX framework
- The open-source ML community