Zichen Tian, Antoine Ledent, Qianru Sun
Singapore Management University
Official implementation of mtLoRA (multi-task LoRA) from the paper "Scalable Multi-Task Low-Rank Model Adaptation" (ICLR 2026). Scaling multi-task LoRA to many tasks (15β25+) causes catastrophic performance collapse (e.g., 88.2% β 2.0% accuracy). We identify two root causes β uniform regularization disrupts shared knowledge and component-level adaptation amplifies gradient conflicts β and propose three novel designs:
- Spectral-Aware Regularization β Selectively orthogonalizes low-SV noise while preserving high-SV shared knowledge
- Fine-Grained Routing β Dimension-specific routing weights instead of scalar weights per LoRA expert
- Block-Level Adaptation β Applies LoRA as a parallel path at the block level, bypassing conflict-amplifying non-linearities
(A) Block-Level Adaptation bypasses internal non-linearities to mitigate gradient conflict.
(B) Fine-Grained Routing assigns dimension-specific weights for superior expressive power.
One-click experiment reproduction powered by Claude Code. Open this project in Cursor or install the Claude Code CLI β the agent reads
CLAUDE.mdand handles environment setup, data download, and experiment execution automatically.
π¬ "Help me reproduce Table 2 on my 2Γ L40 setup"
π¬ "Set up the environment for my RTX 4090"
π¬ "Run the BBH evaluation with spectral regularization Ξ»=0.5"
+2.3% over SOTA across four large-scale benchmarks (15β27 tasks each) while using 47% fewer parameters and 24% less training time.
NLP results on LLaMA-2-7B (reproduced by this codebase):
| Method | Dolly-15k β MMLU | Flan-v2 β BBH | Params |
|---|---|---|---|
| LoRAHub | 42.0 | 34.9 | 75.5M (1.11%) |
| MMoELoRA | 42.1 | 35.4 | 75.5M (1.11%) |
| HydraLoRA | 42.4 | 36.9 | 75.5M (1.11%) |
| mtLoRA (Ours) | 44.5 | 38.5 | 39.8M (0.59%) |
Each design contributes meaningfully β block-level adaptation alone provides +2.1% with 50% fewer parameters:
| Block-Level | Spectral Reg. | Fine-Grained Routing | Params | Dolly-15k | BBH |
|---|---|---|---|---|---|
| 75.5M (1.11%) | 41.6 | 35.5 | |||
| β | 37.7M (0.56%) | 43.7 | 37.9 | ||
| β | β | 37.7M (0.56%) | 43.6 | 38.4 | |
| β | β | 39.8M (0.59%) | 44.1 | 38.2 | |
| β | β | β | 39.8M (0.59%) | 44.5 | 38.5 |
- Python 3.10+ β|β PyTorch 2.1+ β|β CUDA 11.8+
- 1β2 GPUs with β₯16 GB VRAM (for LLaMA-2-7B with DDP)
# Create environment
conda env create -f environment.yml
conda activate mtlora
# Install our custom PEFT library
pip install -e ./peftBlackwell GPUs (CUDA 12.4+)
conda env create -f environment_cu124.yml
conda activate mtlora
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -e ./peftBase Model β Symlink LLaMA-2-7B (required for all experiments):
ln -s /path/to/llama-2-7b ./data/llama-2-7b
ln -s /path/to/llama-2-13b ./data/llama-2-13b # Only needed for Table S7Training Data β Download from Hugging Face:
| Setup | Training Data | Evaluation | HF Source |
|---|---|---|---|
| BBH | Flan-v2 subset (30k examples) | BBH 3-shot (27 tasks) | Muennighoff/flan |
| MMLU | Dolly-15K (instruction tuning) | MMLU 5-shot (57 subjects) | databricks/databricks-dolly-15k |
Evaluation datasets (data/bbh/ and data/mmlu_dataset/) are already included.
| Script | Paper Reference | Description |
|---|---|---|
bash tables/0_main_ablation.sh |
Table 2 | Contribution of each key design |
bash tables/1_routing_granularity.sh |
Table 3 | Routing granularity ablation |
bash tables/2_block_level.sh |
Table 4 | Block-level adaptation ablation |
bash tables/3_llama13b.sh |
Table S7 | LLaMA-2-13B scalability |
Each script runs both BBH and MMLU experiments end-to-end (training + evaluation).
BBH Setup β Train on Flan-v2, evaluate on BBH (3-shot)
# Train
python train.py \
--method mtlora \
--model_name_or_path ./data/llama-2-7b \
--dataset_dir ./data/flan_v2_subset \
--output_dir ./output/custom_bbh \
--lora_rank 16 --lora_nums 16 --enable_blc \
--enable_block_adapter --block_adapter_type ffn \
--enable_spectral_reg --spectral_reg_lambda 1.0 \
--enable_fine_grained_routing --routing_group_size 2048 \
--bf16 --num_train_epochs 1
# Evaluate
python eval_bbh.py \
--model_name_or_path ./data/llama-2-7b \
--lora_checkpoint ./output/custom_bbh/sft_lora_model \
--output_dir ./output/custom_bbh/bbh_eval \
--num_few_shot 3MMLU Setup β Train on Dolly-15K, evaluate on MMLU (5-shot)
# Train
python train.py \
--method mtlora \
--model_name_or_path ./data/llama-2-7b \
--dataset_dir ./data/dolly-15k-converted \
--output_dir ./output/custom_mmlu \
--lora_rank 16 --lora_nums 16 --enable_blc \
--enable_block_adapter --block_adapter_type ffn \
--enable_spectral_reg --spectral_reg_lambda 0.5 \
--enable_fine_grained_routing --routing_group_size 2048 \
--bf16 --num_train_epochs 1
# Evaluate
python eval_mmlu.py \
--model_name_or_path ./data/llama-2-7b \
--lora_checkpoint ./output/custom_mmlu/sft_lora_model \
--output_dir ./output/custom_mmlu/mmlu_5shot \
--num_few_shot 5 \
--mmlu_data_dir ./data/mmlu_datasetScripts for reproducing paper figures are in tables/analysis/:
| Script | Paper Figure | Content |
|---|---|---|
fig1a_routing_entropy.ipynb |
Figure 1(A) | Regularizationβrouting trade-off |
fig1b_spectral_conflict.ipynb |
Figure 1(B) | Spectral conflict analysis |
figS2_sv_spectrum.py |
Figure S2 | SV spectrum visualization |
figS3_gradient_perlayer.py |
Figure S3 | Per-layer gradient correlation |
figS4_routing_pattern.py |
Figure S4 | Routing weight patterns |
Multi-task LoRA suffers from a fundamental regularizationβrouting trade-off: strengthening regularization to reduce inter-task conflict inadvertently suppresses routing effectiveness. We trace this to two root causes and propose targeted solutions:
(A) Regularization-routing trade-off. (B) Shared knowledge concentrates in high-SV components. (C) Block-level adaptation reduces gradient conflict by 76%.
| Design | Root Cause Addressed | Key Idea |
|---|---|---|
| π― Spectral-Aware Reg. | Uniform regularization disrupts shared knowledge | Weight by w(Ο)=exp(βΟ/ΟΜ): orthogonalize low-SV noise, preserve high-SV signal |
| π Fine-Grained Routing | Scalar routing ignores dimension heterogeneity | Router MLP outputs per-dimension weights Ξ α΅’ β βα΅ instead of scalars Οα΅’ β β |
| π§± Block-Level Adaptation | Component-level LoRA amplifies gradient conflicts | Parallel adapter path bypasses Softmax: x' = x + F(LN(x)) + Ξ(LN(x)) |
Overall architecture of mtLoRA. The mtLoRA module (right) is attached as a parallel path after each LayerNorm. A router MLP generates dimension-specific weights to dynamically compose task experts.
Method Selection
--method lora # Standard single LoRA
--method hydralora # HydraLoRA baseline (multi-expert, no mtLoRA extensions)
--method mtlora # Full mtLoRA (block adapter + spectral reg + FGR)mtLoRA Components
# Block-Level Adaptation
--enable_block_adapter # Enable block-level instead of component-level
--block_adapter_type ffn # Options: attention, ffn, both
--block_adapter_style lowrank
# Spectral-Aware Regularization
--enable_spectral_reg # Enable spectral regularization
--spectral_reg_lambda 1.0 # Regularization strength
--spectral_reg_frequency 1 # SVD frequency (per epoch)
# Fine-Grained Routing
--enable_fine_grained_routing
--routing_group_size 2048 # Smaller = finer granularity (g = d/group_size)Common Hyperparameters
--lora_rank 16 # LoRA rank
--lora_alpha 64 # LoRA alpha scaling
--learning_rate 0.0002
--per_device_train_batch_size 16
--num_train_epochs 1
--max_seq_length 512Hardware Requirements
| Experiment | GPU Memory | Recommended |
|---|---|---|
| LLaMA-7B (single GPU) | ~24 GB | RTX PRO 6000 |
| LLaMA-7B (DDP, 2 GPU) | ~16 GB each | 2Γ L40 |
| LLaMA-13B | ~48 GB | A100-80GB |
For memory-constrained setups, reduce --per_device_train_batch_size and increase --gradient_accumulation_steps.
If you find this work useful, please consider citing our paper:
@inproceedings{tian2026mtlora,
title = {Scalable Multi-Task Low-Rank Model Adaptation},
author = {Tian, Zichen and Ledent, Antoine and Sun, Qianru},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}We gratefully acknowledge the support from the DSO research grant awarded by DSO National Laboratories, Singapore. This project is also partially supported by the Ministry of Education, Singapore, under its Tier-1 Academic Research Fund (No. 24-SIS-SMU-040). We thank the authors of HydraLoRA, MMoELoRA, and LoRAHub for their open-source implementations.
This project is licensed under the Apache License 2.0.
Keywords: mtLoRA, multi-task LoRA, scalable multi-task LoRA, multi-task low-rank adaptation, parameter-efficient fine-tuning (PEFT), LoRA, low-rank adaptation, mixture of LoRA experts, LLaMA, LLM fine-tuning, spectral regularization, block-level adaptation, fine-grained routing, ICLR 2026
