Skip to content

WikiChao/DRIFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DRIFT: Directional Reasoning Injection for Fine-Tuning MLLMs

Paper Project Page Model Dataset License

Overview figure

DRIFT transfers reasoning from DeepSeekR1 into QwenVL through gradient guidance.


Abstract

Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is \textit{model merging}, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.


News

  • 2025-10-16 — Initial code release

Environment

DRIFT can be integrated into most LLM/VLM training stacks. This repository provides a reference implementation compatible with LLaMA-Factory.

  • Conda:
# create and activate environment
conda create -n drift python=3.12 -y
conda activate drift
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Quick Start

To train the model, you may need to first download the dataset locally:

conda activate drift
git lfs install
cd LLaMA-Factory
git clone https://huggingface.co/datasets/ChaoHuangCS/DRIFT-TL-Distill-4K
  • Then run on an example script:
llamafactory-cli train examples/train_full_merge/qwen2_5vl_full_sft_merge.yaml

Datasets

Our dataset is available on Hugging Face: ChaoHuangCS/DRIFT-TL-Distill-4K

Quick load:

from datasets import load_dataset
ds = load_dataset("ChaoHuangCS/DRIFT-TL-Distill-4K")
print(ds)

Models

Our model is available on Hugging Face: ChaoHuangCS/DRIFT-VL-7B

Quick load:

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
proc = AutoProcessor.from_pretrained("ChaoHuangCS/DRIFT-VL-7B", trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("ChaoHuangCS/DRIFT-VL-7B", torch_dtype="auto", device_map="auto", trust_remote_code=True)

Evaluation

We use VLMEvalKit for evaluation. Please follow their instructions: https://github.com/open-compass/VLMEvalKit

Quick start:

# clone and install
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

# then follow the repo's Quick Start to select datasets and model adapters
# example (refer to VLMEvalKit docs for exact flags/model tags):
# python run.py --model {{VLMEvalKit_MODEL_TAG}} --datasets {{DATASET_NAME}}

Results

  • Main benchmark:
Model MathVista MathVision MathVerse WeMath LogicVista
R1-Onevision-7B 64.1 29.9 40.0 61.8
OpenVLThinker-7B 65.3 23.0 38.1 35.2 44.5
R1-VL-7B 63.5 24.7 40.0
X-REASONER 69.0 29.6
QwenVL2.5 (SFT) 68.7 25.1 42.0 33.3 45.6
DRIFT (Ours) 70.3 (+1.6) 26.5 (+1.5) 43.7 (+1.7) 36.9 (+3.6) 45.6 (+0.0)

Cite Us

If you find this work useful, please cite:

@article{huang2025drift,
  title={Directional Reasoning Injection for Fine-Tuning {MLLMs}},
  author={Huang, Chao and Zhang, Zeliang and Liu, Jiang and Sun, Ximeng and Wu, Jialian and Yu, Xiaodong and Wang, Ze and Xu, Chenliang and Barsoum, Emad and Liu, Zicheng},
  journal={arXiv preprint arXiv:2510.15050},
  year={2025},
  url={https://arxiv.org/abs/2510.15050}
}

Acknowledgements


License

This project is licensed under the MIT License. See the LICENSE file for details.

About

Fine Tuning MLLMs with Reasoning Priors from DeepSeekR1

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published