DRIFT: Directional Reasoning Injection for Fine-Tuning MLLMs

DRIFT transfers reasoning from DeepSeekR1 into QwenVL through gradient guidance.

Abstract

Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is \textit{model merging}, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.

News

2025-10-16 — Initial code release

Environment

DRIFT can be integrated into most LLM/VLM training stacks. This repository provides a reference implementation compatible with LLaMA-Factory.

Conda:

# create and activate environment
conda create -n drift python=3.12 -y
conda activate drift
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Quick Start

To train the model, you may need to first download the dataset locally:

conda activate drift
git lfs install
cd LLaMA-Factory
git clone https://huggingface.co/datasets/ChaoHuangCS/DRIFT-TL-Distill-4K

Then run on an example script:

llamafactory-cli train examples/train_full_merge/qwen2_5vl_full_sft_merge.yaml

Datasets

Our dataset is available on Hugging Face: ChaoHuangCS/DRIFT-TL-Distill-4K

Quick load:

from datasets import load_dataset
ds = load_dataset("ChaoHuangCS/DRIFT-TL-Distill-4K")
print(ds)

Models

Our model is available on Hugging Face: ChaoHuangCS/DRIFT-VL-7B

Quick load:

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
proc = AutoProcessor.from_pretrained("ChaoHuangCS/DRIFT-VL-7B", trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("ChaoHuangCS/DRIFT-VL-7B", torch_dtype="auto", device_map="auto", trust_remote_code=True)

Evaluation

We use VLMEvalKit for evaluation. Please follow their instructions: https://github.com/open-compass/VLMEvalKit

Quick start:

# clone and install
git clone https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .

# then follow the repo's Quick Start to select datasets and model adapters
# example (refer to VLMEvalKit docs for exact flags/model tags):
# python run.py --model {{VLMEvalKit_MODEL_TAG}} --datasets {{DATASET_NAME}}

Results

Main benchmark:

Model	MathVista	MathVision	MathVerse	WeMath	LogicVista
R1-Onevision-7B	64.1	29.9	40.0	—	61.8
OpenVLThinker-7B	65.3	23.0	38.1	35.2	44.5
R1-VL-7B	63.5	24.7	40.0	—	—
X-REASONER	69.0	29.6	—	—	—
QwenVL2.5 (SFT)	68.7	25.1	42.0	33.3	45.6
DRIFT (Ours)	70.3 (+1.6)	26.5 (+1.5)	43.7 (+1.7)	36.9 (+3.6)	45.6 (+0.0)

Cite Us

If you find this work useful, please cite:

@article{huang2025drift,
  title={Directional Reasoning Injection for Fine-Tuning {MLLMs}},
  author={Huang, Chao and Zhang, Zeliang and Liu, Jiang and Sun, Ximeng and Wu, Jialian and Yu, Xiaodong and Wang, Ze and Xu, Chenliang and Barsoum, Emad and Liu, Zicheng},
  journal={arXiv preprint arXiv:2510.15050},
  year={2025},
  url={https://arxiv.org/abs/2510.15050}
}

Acknowledgements

This project builds on LLaMA-Factory. Thanks to the authors and contributors.
Evaluation leverages VLMEvalKit.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LLaMA-Factory		LLaMA-Factory
asset		asset
core		core
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DRIFT: Directional Reasoning Injection for Fine-Tuning MLLMs

Abstract

News

Environment

Quick Start

Datasets

Models

Evaluation

Results

Cite Us

Acknowledgements

License

About

Uh oh!

Releases

Packages

Languages

License

WikiChao/DRIFT

Folders and files

Latest commit

History

Repository files navigation

DRIFT: Directional Reasoning Injection for Fine-Tuning MLLMs

Abstract

News

Environment

Quick Start

Datasets

Models

Evaluation

Results

Cite Us

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages