Progressive curriculum learning for LLM training with fine-grained schedule control.
Curriculus helps you gradually mix and transition between different datasets during training. Instead of throwing all your data at a model at once, you can start with simpler data (e.g., basic easy), smoothly transition to more complex data (e.g., medium), and finally move to task-specific data (e.g., hard tuning).
The key insight: linear interpolation between probability schedules. You define milestones (e.g., "at 20%, start mixing medium in"), and the library handles the smooth transition with mathematically correct sampling.
Training on progressively more complex data can:
- ✅ Improve model convergence and final performance
- ✅ Reduce training instability and catastrophic forgetting
- ✅ Allow precise control over when each dataset is used
- ✅ Handle datasets of different sizes gracefully
pip install curriculusWith PyTorch support:
pip install curriculus[torch]from curriculus import Curriculus
# Your datasets
datasets = [
{"name": "easy", "dataset": easy_data},
{"name": "medium", "dataset": medium_data},
{"name": "hard", "dataset": hard_data},
]
# Auto-generates: easy -> medium -> hard with train/test split
dataset_dict = Curriculus(datasets, train_ratio=0.8)
# Use with your trainer
for sample in dataset_dict["train"]:
# sample comes from the appropriate dataset based on training progress
passfrom curriculus import Curriculus
# Explicit schedule: define milestones and weights
schedule = [
(0.0, {"easy": 1.0, "medium": 0.0, "hard": 0.0}),
(0.2, {"easy": 1.0, "medium": 0.0, "hard": 0.0}), # Warmup
(0.4, {"easy": 0.5, "medium": 0.5, "hard": 0.0}), # Easing
(0.6, {"easy": 0.0, "medium": 1.0, "hard": 0.0}), # Pure medium
(0.8, {"easy": 0.0, "medium": 0.5, "hard": 0.5}), # Mix
(1.0, {"easy": 0.0, "medium": 0.0, "hard": 1.0}), # Pure hard
]
dataset_dict = Curriculus(
datasets,
schedule=schedule,
total_steps=10000,
oversampling=True, # Repeat data if insufficient
best_effort=True, # Scale down gracefully if short (default)
train_ratio=0.9, # 90% train, 10% test
)
# Access splits
train_data = dataset_dict["train"]
test_data = dataset_dict["test"]A schedule is a list of (progress_percent, {dataset: weight}) tuples:
- progress_percent (0.0 to 1.0): Where you are in training
- weight: Probability of sampling from that dataset at this milestone
The library linearly interpolates between milestones. If you define:
- 0%: easy=1.0
- 100%: medium=1.0
Then at 50% progress, both have weight 0.5 (50/50 mix).
If you don't have enough data:
- best_effort=True (default): Reduces the dataset's sampling probability to make it last
- oversampling=True: Repeats data to fulfill the schedule
- Both False: Raises an error
Example: If medium appears in the schedule but you only have 50% of the required samples:
- Best effort scales it down by 50%
- Other datasets naturally expand to fill the gap
- Training completes without crashing
Sizes are inferred automatically:
datasets = [
{"name": "A", "dataset": my_dataset}, # len() called automatically
]Or specified manually:
datasets = [
{"name": "A", "dataset": huggingface_repo_id, "size": 50000}, # For streaming
]- datasets: List of
{"name": ..., "dataset": ...}dicts. - schedule: List of
(progress, weights)tuples. If None, auto-generates sequential schedule. - total_steps: Total training steps. If None, sums all dataset sizes.
- oversampling: If True, repeats data when insufficient. Default: False.
- best_effort: If True, scales down dataset usage gracefully. Default: True.
- train_ratio: Fraction of total steps for train split (0.0-1.0). Default: 1.0 (train only).
- split_names: Tuple of (train_name, test_name). Default: (
"train","test").
Returns:
CurriculusSplits mapping of split names to iterable datasets.
Each split exposes convenient helpers to explore and transform the stream:
peek/head/takepreview upcoming samples without exhausting the iterator.columns,shape, andnum_columnssurface lightweight schema metadata.remove_column/rename_column(s)andmapenable lazy columnar transforms.to_hf_iterable_dataset()andto_hf_dataset()materialise into Hugging Facedatasets.IterableDatasetordatasets.Datasetobjects when you need the full HF toolkit.
from curriculus import Curriculus
# Step 1: Load your datasets
easy_data = load_dataset("my_dataset/easy")
medium_data = load_dataset("my_dataset/medium")
hard_data = load_dataset("my_dataset/hard")
# Step 2: Define the curriculum
datasets = [
{"name": "easy", "dataset": easy_data},
{"name": "medium", "dataset": medium_data},
{"name": "hard", "dataset": hard_data},
]
# Step 3: Create dataset with 85% train split
curriculum_dict = Curriculus(
datasets,
total_steps=100_000,
oversampling=True,
train_ratio=0.85
)
# Step 4: Use splits
for batch in DataLoader(curriculum_dict["train"], batch_size=32):
loss = model.train_step(batch)
for batch in DataLoader(curriculum_dict["test"], batch_size=32):
metrics = model.eval_step(batch)Check your schedule without training:
from curriculus import CurriculusPlanner
planner = CurriculusPlanner(
datasets,
schedule=my_schedule,
total_steps=100_000,
oversampling=False,
best_effort=True,
)
print(planner.get_plan_summary())
# Output:
# Total Steps: 100000
# Dataset Budget:
# easy: OK (1000000 available)
# medium: SCALED (50000 available, 60000 needed (0.83x))
# hard: OK (30000 available)The library separates concerns:
- CurriculusPlanner: Validates schedules, calculates sample budgets, pre-flight checks
- Curriculus: Implements the actual sampling at training time
This allows you to validate your configuration before training starts, catching issues early.
Validates and calculates sample budgets.
planner = CurriculusPlanner(
datasets,
schedule=my_schedule,
total_steps=100_000,
oversampling=True,
best_effort=True,
)
# Inspect
print(planner.scale_factors) # Dict of scaling factors
print(planner.dataset_integrals) # Area under each curve
print(planner.get_plan_summary()) # Human-readable planIterates over mixed samples and exposes helpful adapters.
dataset_splits = Curriculus(
datasets,
schedule=...,
total_steps=100_000,
)
for sample in dataset_splits["train"]:
# Sample is from the appropriate dataset based on progress
pass
# Optional helpers
hf_iterable = dataset_splits["train"].to_hf_iterable_dataset()
hf_dataset = dataset_splits["train"].to_hf_dataset()
# Or directly on the dataset splits
hf_iterable = dataset_splits.to_hf_iterable_dataset()
hf_dataset = dataset_splits.to_hf_dataset()Auto-generates a simple crossfade schedule. This function is called by default if you don't provide a schedule, and you will rarely need to use it directly.
from curriculus import generate_sequential_schedule
schedule = generate_sequential_schedule(["dataset_A", "dataset_B", "dataset_C"])
# Result: A (100%) -> B (100%) -> C (100%)# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# With coverage
pytest --cov=curriculus
# View HTML coverage report
pytest --cov=curriculus --cov-report=html
# Open htmlcov/index.htmlContributions welcome! Please:
- Fork the repo
- Create a feature branch
- Add tests for your changes
- Ensure tests pass:
pytest --cov=curriculus - Run linter:
ruff check --fix . - Submit a pull request
MIT License. See LICENSE file for details.
If you use this library in research, please cite:
@software{curriculus2025,
title={Curriculus: Progressive Curriculum Learning Datasets for LLM Training},
author={Omar Kamali},
year={2025},
url={https://github.com/omarkamali/curriculus}
}You have more schedule demand than available data:
- Solution 1: Enable
best_effort=True(default) - Solution 2: Enable
oversampling=True - Solution 3: Increase dataset size or reduce
total_steps
Your schedule is invalid:
# ❌ Bad
schedule = [(0.0, {"A": 0.8, "B": 0.1})] # Sum = 0.9
# ✅ Good
schedule = [(0.0, {"A": 0.8, "B": 0.2})] # Sum = 1.0Check that your schedule includes all datasets. If a dataset doesn't appear in the schedule, it's never sampled.
Open an issue: https://github.com/omarkamali/curriculus/issues
Explore end-to-end walkthroughs in the examples/ directory:
- Sequential difficulty fade – examples/01_easy_medium_hard.ipynb
- Conversation length autoschedule – examples/02_ultrachat_bucket_autoschedule.ipynb