A complete pipeline for fine-tuning Language Models on German text using Aleph Alpha's synthetic dataset. Optimized for Apple M1 Macs with memory-efficient training and MLflow experiment tracking.
Fine-tunes existing LLMs (not training from scratch):
- Takes pre-trained models like
microsoft/DialoGPT-small
orgpt2
- Fine-tunes them on German synthetic text data
- Uses LoRA (Low-Rank Adaptation) for efficient training on M1 Macs
- Tracks experiments with MLflow for reproducibility and comparison
- Produces models that can generate German text in different styles
The model learns to predict the next word in German text, giving it capabilities like:
- Text completion
- German text generation
- Style-aware generation (different prompt types from the dataset)
Using Aleph Alpha's German synthetic dataset with 5 prompt types:
Prompt ID | Type | Description | Count |
---|---|---|---|
0 | Rephrasing | Text rephrasing tasks | 995K (26.5%) |
1 | Summarization | Text summarization | 219K (5.8%) |
2 | Wikipedia Style | Rephrasing in Wikipedia style | 964K (25.7%) |
3 | Questions | Formulating questions | 810K (21.6%) |
4 | Lists | Extracting lists | 770K (20.5%) |
Total: 3.7M samples, 2.5GB parquet file
# Clone and setup
git clone llm-finetuning
cd llm-finetuning
# Install dependencies with uv (includes dev tools)
uv sync
# Install pre-commit hooks for code quality
uv run pre-commit install
# Download the synthetic dataset (2.5GB)
uv run hf download Aleph-Alpha/Aleph-Alpha-GermanWeb --repo-type dataset --include "*synthetic/part.1.parquet*" --local-dir <download-dir>
Start with smaller samples for testing:
# Generate stratified samples (100, 500, 1K, 5K rows)
uv run src/data_preprocessing/generate_sample_dataset.py
# Train on 1000 samples with MLflow tracking (10-15 minutes)
uv run src/llm_pipeline.py \
--data data/sample/sample_1000.parquet \
--model microsoft/DialoGPT-small \
--epochs 2
# Train without MLflow tracking
uv run src/llm_pipeline.py \
--data data/sample/sample_1000.parquet \
--model microsoft/DialoGPT-small \
--epochs 2 \
--no_mlflow
# Start MLflow UI (accessible at http://localhost:5000)
uv run mlflow ui --backend-store-uri experiments/mlruns
# Or specify a different port
uv run mlflow ui --backend-store-uri experiments/mlruns --port 5000
# Test the trained model
uv run src/models/test_llm.py \
--model_path outputs/german_llm_microsoft_DialoGPT-small \
--mode test
# Chat with your model (interactive)
uv run src/models/test_llm.py \
--model_path outputs/german_llm_microsoft_DialoGPT-small \
--mode chat
# Get model info
uv run src/models/test_llm.py \
--model_path outputs/german_llm_microsoft_DialoGPT-small \
--mode info
Recommended for M1 Mac (tested):
# Small and fast
--model microsoft/DialoGPT-small # 117M params, conversational
--model gpt2 # 124M params, general purpose
--model distilgpt2 # 82M params, fastest
# Larger (if you have 16GB+ RAM)
--model microsoft/DialoGPT-medium # 345M params
--model gpt2-medium # 345M params
All training parameters are now centralized in configs/config.py
for easy modification:
# Basic training with default config
uv run src/llm_pipeline.py --data data/sample/sample_1000.parquet
# Override specific parameters
uv run src/llm_pipeline.py \
--data data/sample/sample_5000.parquet \
--model microsoft/DialoGPT-small \
--epochs 3 \
--batch_size 4 \
--learning_rate 2e-4
Default Configuration Values:
- Model:
microsoft/DialoGPT-small
(117M parameters) - Epochs: 3
- Batch Size: 4 (optimized for M1 Mac)
- Learning Rate: 2e-4
- Max Length: 512 tokens
- LoRA: Enabled (memory efficient fine-tuning)
MLflow automatically organizes experiments:
experiments/mlruns/
├── german_llm_finetuning/ # Experiment name
│ ├── run_20240903_143022/ # Individual training runs
│ │ ├── params/ # Training parameters
│ │ ├── metrics/ # Training metrics
│ │ ├── artifacts/ # Model files & logs
│ │ └── meta.yaml # Run metadata
│ └── run_20240903_145511/ # Another training run
└── .../
llm-finetuning/
├── src/
│ ├── data_preprocessing/
│ │ └── generate_sample_dataset.py # Create sample datasets
│ ├── models/
│ │ ├── llm_trainer.py # Core LLM training logic
│ │ └── test_llm.py # Model testing & inference
│ ├── utils/
│ │ ├── helpers.py # Utility functions
│ │ └── mlflow_utilities.py # MLflow tracking utilities
│ ├── llm_pipeline.py # Main training pipeline
│ └── constants.py # Project-wide constants
├── configs/
│ └── config.py # Centralized configuration
├── data/
│ └── sample/ # Generated sample datasets
├── outputs/ # Saved trained models
├── experiments/ # MLflow experiments & tracking
│ └── mlruns/ # MLflow run data & artifacts
├── tests/ # Unit tests
└── pyproject.toml # Dependencies & tool config
The pipeline now uses centralized configuration for better maintainability:
src/constants.py
: Project paths, model recommendations, and static valuesconfigs/config.py
: Training parameters, model configurations, and runtime settings
# LLM model settings
LLM_CONFIG = {
"default_model": "microsoft/DialoGPT-small",
"max_length": 512,
"use_lora": True,
# ... LoRA configuration
}
# Training parameters
TRAINING_CONFIG = {
"num_epochs": 3,
"batch_size": 4,
"learning_rate": 2e-4,
# ... other training settings
}
# Sampling configuration
SAMPLING_CONFIG = {
"default_sample_size": 1000,
"stratified_sampling": True,
# ... sampling parameters
}
- Load Data: Sample German text from Aleph Alpha dataset
- Tokenize: Convert text to tokens the model understands
- Fine-tune: Update model weights to predict German text better
- Save: Store the fine-tuned model for later use
We use LoRA for efficient training:
- Faster: Only trains ~1% of model parameters
- Memory Efficient: Perfect for M1 Macs
- Quality: 95%+ of full fine-tuning performance
- Flexible: Easy to swap different LoRA adapters
Estimated RAM usage:
DialoGPT-small
: ~2-4GBDialoGPT-medium
: ~6-8GBGPT2-medium
: ~6-8GB
M1 Mac recommendations:
- 8GB RAM: Use small models, batch_size=2
- 16GB RAM: Use medium models, batch_size=4
- 32GB+ RAM: Any model, larger batch sizes
# Run predefined tests
uv run src/models/test_llm.py --model_path outputs/your_model --mode test
# Chat with your model
uv run src/models/test_llm.py --model_path outputs/your_model --mode chat
# Example interaction:
> Enter prompt: Die deutsche Sprache ist
> Model: sehr komplex und hat viele grammatische Regeln...
# Get model information
uv run src/models/test_llm.py --model_path outputs/your_model --mode info
- Experiment with hyperparameters using MLflow tracking
- Compare different models and training configurations
- Deploy the model for inference
- Scale up training with larger datasets
When training completes, you'll see:
======================================================================
LLM TRAINING PIPELINE SUMMARY
======================================================================
Status: SUCCESS
Model: microsoft/DialoGPT-small
Training Samples: 400
Final Loss: 8.3107
Total Time: 425.3s
Model Path: ./outputs/german_llm_microsoft_DialoGPT-small
Next Steps:
1. Test your model:
uv run src/models/test_llm.py --model_path ./outputs/your_model
2. View MLflow experiments:
uv run mlflow ui --backend-store-uri experiments/mlruns
3. Try different prompts and see how it performs!
======================================================================
MLflow automatically tracks:
- All training parameters and hyperparameters
- Training and validation loss over time
- Model artifacts and metadata
- Training duration and system metrics
Access the MLflow UI to compare runs and analyze training performance!