Skip to content

doem97/ICLR26_mtLoRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

mtLoRA: Scalable Multi-Task Low-Rank Model Adaptation

🌟 ICLR 2026 🌟

Zichen Tian, Antoine Ledent, Qianru Sun

Singapore Management University

SMU Logo    ICLR Logo

Paper ArXiv PyTorch 2.1 AI Agent Reproduction

Official implementation of mtLoRA (multi-task LoRA) from the paper "Scalable Multi-Task Low-Rank Model Adaptation" (ICLR 2026). Scaling multi-task LoRA to many tasks (15–25+) causes catastrophic performance collapse (e.g., 88.2% β†’ 2.0% accuracy). We identify two root causes β€” uniform regularization disrupts shared knowledge and component-level adaptation amplifies gradient conflicts β€” and propose three novel designs:

  1. Spectral-Aware Regularization β€” Selectively orthogonalizes low-SV noise while preserving high-SV shared knowledge
  2. Fine-Grained Routing β€” Dimension-specific routing weights instead of scalar weights per LoRA expert
  3. Block-Level Adaptation β€” Applies LoRA as a parallel path at the block level, bypassing conflict-amplifying non-linearities
mtLoRA Architecture
(A) Block-Level Adaptation bypasses internal non-linearities to mitigate gradient conflict.
(B) Fine-Grained Routing assigns dimension-specific weights for superior expressive power.

πŸ€– AI Agent Reproduction

One-click experiment reproduction powered by Claude Code. Open this project in Cursor or install the Claude Code CLI β€” the agent reads CLAUDE.md and handles environment setup, data download, and experiment execution automatically.

πŸ’¬ "Help me reproduce Table 2 on my 2Γ— L40 setup"
πŸ’¬ "Set up the environment for my RTX 4090"
πŸ’¬ "Run the BBH evaluation with spectral regularization Ξ»=0.5"

✨ Highlights

+2.3% over SOTA across four large-scale benchmarks (15–27 tasks each) while using 47% fewer parameters and 24% less training time.

NLP results on LLaMA-2-7B (reproduced by this codebase):

Method Dolly-15k β†’ MMLU Flan-v2 β†’ BBH Params
LoRAHub 42.0 34.9 75.5M (1.11%)
MMoELoRA 42.1 35.4 75.5M (1.11%)
HydraLoRA 42.4 36.9 75.5M (1.11%)
mtLoRA (Ours) 44.5 38.5 39.8M (0.59%)

Each design contributes meaningfully β€” block-level adaptation alone provides +2.1% with 50% fewer parameters:

Block-Level Spectral Reg. Fine-Grained Routing Params Dolly-15k BBH
75.5M (1.11%) 41.6 35.5
βœ“ 37.7M (0.56%) 43.7 37.9
βœ“ βœ“ 37.7M (0.56%) 43.6 38.4
βœ“ βœ“ 39.8M (0.59%) 44.1 38.2
βœ“ βœ“ βœ“ 39.8M (0.59%) 44.5 38.5

πŸš€ Getting Started

Requirements

  • Python 3.10+  |  PyTorch 2.1+  |  CUDA 11.8+
  • 1–2 GPUs with β‰₯16 GB VRAM (for LLaMA-2-7B with DDP)

Installation

# Create environment
conda env create -f environment.yml
conda activate mtlora

# Install our custom PEFT library
pip install -e ./peft
Blackwell GPUs (CUDA 12.4+)
conda env create -f environment_cu124.yml
conda activate mtlora
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -e ./peft

Data Preparation

Base Model β€” Symlink LLaMA-2-7B (required for all experiments):

ln -s /path/to/llama-2-7b ./data/llama-2-7b
ln -s /path/to/llama-2-13b ./data/llama-2-13b   # Only needed for Table S7

Training Data β€” Download from Hugging Face:

Setup Training Data Evaluation HF Source
BBH Flan-v2 subset (30k examples) BBH 3-shot (27 tasks) Muennighoff/flan
MMLU Dolly-15K (instruction tuning) MMLU 5-shot (57 subjects) databricks/databricks-dolly-15k

Evaluation datasets (data/bbh/ and data/mmlu_dataset/) are already included.


πŸ”§ Reproduce Paper Results

Main Tables

Script Paper Reference Description
bash tables/0_main_ablation.sh Table 2 Contribution of each key design
bash tables/1_routing_granularity.sh Table 3 Routing granularity ablation
bash tables/2_block_level.sh Table 4 Block-level adaptation ablation
bash tables/3_llama13b.sh Table S7 LLaMA-2-13B scalability

Each script runs both BBH and MMLU experiments end-to-end (training + evaluation).

Custom Experiments

BBH Setup β€” Train on Flan-v2, evaluate on BBH (3-shot)
# Train
python train.py \
    --method mtlora \
    --model_name_or_path ./data/llama-2-7b \
    --dataset_dir ./data/flan_v2_subset \
    --output_dir ./output/custom_bbh \
    --lora_rank 16 --lora_nums 16 --enable_blc \
    --enable_block_adapter --block_adapter_type ffn \
    --enable_spectral_reg --spectral_reg_lambda 1.0 \
    --enable_fine_grained_routing --routing_group_size 2048 \
    --bf16 --num_train_epochs 1

# Evaluate
python eval_bbh.py \
    --model_name_or_path ./data/llama-2-7b \
    --lora_checkpoint ./output/custom_bbh/sft_lora_model \
    --output_dir ./output/custom_bbh/bbh_eval \
    --num_few_shot 3
MMLU Setup β€” Train on Dolly-15K, evaluate on MMLU (5-shot)
# Train
python train.py \
    --method mtlora \
    --model_name_or_path ./data/llama-2-7b \
    --dataset_dir ./data/dolly-15k-converted \
    --output_dir ./output/custom_mmlu \
    --lora_rank 16 --lora_nums 16 --enable_blc \
    --enable_block_adapter --block_adapter_type ffn \
    --enable_spectral_reg --spectral_reg_lambda 0.5 \
    --enable_fine_grained_routing --routing_group_size 2048 \
    --bf16 --num_train_epochs 1

# Evaluate
python eval_mmlu.py \
    --model_name_or_path ./data/llama-2-7b \
    --lora_checkpoint ./output/custom_mmlu/sft_lora_model \
    --output_dir ./output/custom_mmlu/mmlu_5shot \
    --num_few_shot 5 \
    --mmlu_data_dir ./data/mmlu_dataset

Analysis Figures

Scripts for reproducing paper figures are in tables/analysis/:

Script Paper Figure Content
fig1a_routing_entropy.ipynb Figure 1(A) Regularization–routing trade-off
fig1b_spectral_conflict.ipynb Figure 1(B) Spectral conflict analysis
figS2_sv_spectrum.py Figure S2 SV spectrum visualization
figS3_gradient_perlayer.py Figure S3 Per-layer gradient correlation
figS4_routing_pattern.py Figure S4 Routing weight patterns

πŸ’‘ Method Overview

Multi-task LoRA suffers from a fundamental regularization–routing trade-off: strengthening regularization to reduce inter-task conflict inadvertently suppresses routing effectiveness. We trace this to two root causes and propose targeted solutions:

Motivating Observations
(A) Regularization-routing trade-off. (B) Shared knowledge concentrates in high-SV components. (C) Block-level adaptation reduces gradient conflict by 76%.

Design Root Cause Addressed Key Idea
🎯 Spectral-Aware Reg. Uniform regularization disrupts shared knowledge Weight by w(Οƒ)=exp(βˆ’Οƒ/ΟƒΜ„): orthogonalize low-SV noise, preserve high-SV signal
πŸ”€ Fine-Grained Routing Scalar routing ignores dimension heterogeneity Router MLP outputs per-dimension weights Ξ α΅’ ∈ ℝᡍ instead of scalars Ο€α΅’ ∈ ℝ
🧱 Block-Level Adaptation Component-level LoRA amplifies gradient conflicts Parallel adapter path bypasses Softmax: x' = x + F(LN(x)) + Ξ”(LN(x))
Overall Architecture
Overall architecture of mtLoRA. The mtLoRA module (right) is attached as a parallel path after each LayerNorm. A router MLP generates dimension-specific weights to dynamically compose task experts.

βš™οΈ Configuration Reference

Method Selection
--method lora          # Standard single LoRA
--method hydralora     # HydraLoRA baseline (multi-expert, no mtLoRA extensions)
--method mtlora        # Full mtLoRA (block adapter + spectral reg + FGR)
mtLoRA Components
# Block-Level Adaptation
--enable_block_adapter              # Enable block-level instead of component-level
--block_adapter_type ffn            # Options: attention, ffn, both
--block_adapter_style lowrank

# Spectral-Aware Regularization
--enable_spectral_reg               # Enable spectral regularization
--spectral_reg_lambda 1.0           # Regularization strength
--spectral_reg_frequency 1          # SVD frequency (per epoch)

# Fine-Grained Routing
--enable_fine_grained_routing
--routing_group_size 2048           # Smaller = finer granularity (g = d/group_size)
Common Hyperparameters
--lora_rank 16                      # LoRA rank
--lora_alpha 64                     # LoRA alpha scaling
--learning_rate 0.0002
--per_device_train_batch_size 16
--num_train_epochs 1
--max_seq_length 512
Hardware Requirements
Experiment GPU Memory Recommended
LLaMA-7B (single GPU) ~24 GB RTX PRO 6000
LLaMA-7B (DDP, 2 GPU) ~16 GB each 2Γ— L40
LLaMA-13B ~48 GB A100-80GB

For memory-constrained setups, reduce --per_device_train_batch_size and increase --gradient_accumulation_steps.


✍️ Citation

If you find this work useful, please consider citing our paper:

@inproceedings{tian2026mtlora,
    title     = {Scalable Multi-Task Low-Rank Model Adaptation},
    author    = {Tian, Zichen and Ledent, Antoine and Sun, Qianru},
    booktitle = {International Conference on Learning Representations (ICLR)},
    year      = {2026}
}

πŸ™ Acknowledgment

We gratefully acknowledge the support from the DSO research grant awarded by DSO National Laboratories, Singapore. This project is also partially supported by the Ministry of Education, Singapore, under its Tier-1 Academic Research Fund (No. 24-SIS-SMU-040). We thank the authors of HydraLoRA, MMoELoRA, and LoRAHub for their open-source implementations.

πŸ“„ License

This project is licensed under the Apache License 2.0.


Keywords: mtLoRA, multi-task LoRA, scalable multi-task LoRA, multi-task low-rank adaptation, parameter-efficient fine-tuning (PEFT), LoRA, low-rank adaptation, mixture of LoRA experts, LLaMA, LLM fine-tuning, spectral regularization, block-level adaptation, fine-grained routing, ICLR 2026

About

[ICLR 2026] Official implementation (Claude Agent reproduce supported) of paper "mtLoRA: Scalable Multi-Task Low-Rank Model Adaptation" +2.3% over SOTA with 47% fewer parameters

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors