- [2025-11]: We have created two fun slides (Doraemon & Pokemon) to explain OpenMMReasoner. Enjoy :) Credit to the amazing NotebookLM and Gemini-3.
- [2025-11]: 🏆: Top #1 Paper of the day at HuggingFace Daily Papers (Nov.24, 2025), Welcome to checkout our OpenMMReasoner HF Daily Paper!
- [2025-11]: Join our WeChat group by scanning this QR code.
- [2025-11]: We release all of our code, model, data, and pipeline! Check out the OpenMMReasoner collection on Hugging Face.
Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research.
In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.
Please follow the installation instructions in lmms-engine to prepare the environment for supervised fine-tuning.
We provide our source verl code, which is a detached fork from the original verl. You can choose to use either our version (included in this repository) or the original verl for RL training.
The installation steps are similar to the standard verl setup. Please follow the instruction from verl to install all the requirements with an updated version of vllm. Additionally, you need to install math-verify to use our reward function:
pip install math-verifyFor our RL training pipeline, we use the following package versions:
transformers==4.57.1vllm==0.11.0
Please follow the installation instructions in lmms-eval to set up the evaluation environment.
We open-sourced our data processing pipeline and code for the community to follow. To install requirements for Data Pipeline:
cd ./data_pipeline
uv pip install -e .We recommend you to use separate environments if you encounter a conflict in requirements.
We provide a convenient script to download all the required datasets from Hugging Face:
bash examples/openmmreasoner/download_data.sh [LOCAL_DIR]This script will download both the SFT (874K samples) and RL (74K samples) datasets to your specified directory (defaults to ./data).
After installing lmms-engine, you can launch SFT training using either:
Option 1: Using a configuration YAML file
# Edit the dataset paths in sft_example_config.yaml
torchrun --nproc_per_node="8" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="8000" \
-m lmms_engine.launch.cli config_yaml=${CONFIG}Option 2: Using the launch script
# Edit the dataset paths and hyperparameters in the script
bash examples/openmmreasoner/sft_example_launch.shTroubleshooting:
- If you encounter OOM (Out of Memory) errors, reduce the
packing_lengthparameter in your configuration. - If mixing text and image data causes a hang, consider adding a blank dummy image for text-only samples in the m1 dataset.
We provide two example scripts for RL training:
Option 1: Local training
bash examples/openmmreasoner/gspo_n16.shOption 2: Training with Ray
To launch training in multi-node environment, you should first setup ray on your head and worker node. Then submit the job as in the bash script.
bash examples/openmmreasoner/gspo_ray.shMake sure to update the DATA_FOLDER and PROJECT_FOLDER paths in the scripts before launching.
After setting up lmms-eval, use the provided evaluation script:
bash examples/openmmreasoner/eval.sh <CHECKPOINT_PATH> <TASK_NAME>Image Tasks:
bash examples/openmmreasoner/eval.sh /path/to/checkpoint "mmmu_reasoning_reward,wemath_testmini_thinking,mmmu_pro_vision_cot_reward,mmmu_pro_standard_cot_reward,mathvista_testmini_cot_reward,mathvision_reason_testmini_reward,mathvision_reason_test_reward,mathverse_testmini_reward,logicvista_thinking,dynamath,charxiv_val_descriptive_cot,charxiv_val_reasoning_cot"Text Tasks:
bash examples/openmmreasoner/eval.sh /path/to/checkpoint "gpqa_diamond_thinking,aime_agg8"We use an LLM as judge for both evaluation and RL reward calculation. Our default judge model is Qwen/Qwen3-235B-A22B-Instruct-2507.
Steps:
- Set up a server using vLLM or SGLang:
# Example with SGLang
python3 -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
--tp-size 8 \
--dp-size 1 \
--served-model-name judge \
--port 8000 \
--host 0.0.0.0 --mem-fraction-static 0.75- Update the judge service address in your scripts:
- For RL training: Update
OPENAI_BASE_URLingspo_n16.shorgspo_ray.sh - For evaluation: Update
OPENAI_BASE_URLineval.sh
- For RL training: Update
export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://your-judge-server-address:8000/v1"
export OPENAI_MODEL_NAME="judge"
export USE_LLM_JUDGE="True"To follow our data processing pipeline, we provide example scripts in data_pipeline/examples/. The pipeline supports two main operations:
To deduplicate RL training data, follow these steps:
- Prepare the RL configuration: Create a YAML config file based on
data_pipeline/examples/example_rl_config.yaml:
datasets:
- path: /path/to/your/dataset.parquet
data_folder: "/path/to/images"
data_type: parquet- Run embedding: Generate embeddings for the dataset:
cd data_pipeline
bash examples/embed_data.sh /path/to/your_rl_config.yaml cache/embed rl- Run deduplication: Remove duplicates based on embeddings:
bash examples/deduplicate_data.sh /path/to/your_rl_config.yaml cache/embed rl cache/deduplicateTo distill a dataset using a teacher model:
- Prepare the SFT configuration: Create a YAML config file based on
data_pipeline/examples/example_sft_config.yaml:
datasets:
- path: /path/to/your/dataset.parquet
data_folder: "/path/to/images"
data_type: parquet- Run distillation: Edit
data_pipeline/examples/distill_dataset.shto set your server addresses, then run:
cd data_pipeline
bash examples/distill_dataset.shMake sure to configure the model server and judge server URLs in the script before running.
Our OpenMMReasoner-7B (OMR-7B) model demonstrates strong performance across a comprehensive suite of multimodal reasoning benchmarks. With only 874K SFT samples and 74K RL samples—significantly less data than many competing methods—our model achieves state-of-the-art or highly competitive results on 9 out of 14 benchmark tasks. Notably, OMR-7B achieves 79.5% on MathVista testmini (best among all models), 63.8% on MathVerse testmini (best), and 79.0% on WeMath loose (best), demonstrating the effectiveness of our transparent two-stage training recipe. This performance validates our emphasis on data quality and rigorous training design over simply scaling dataset size.
| Model | SFT Data | RL Data | MathVista testmini |
MathVision test |
MathVision testmini |
MathVerse testmini |
DynaMath worst |
WeMath loose |
LogicVista test |
MMMU val |
MMMU-Pro standard |
MMMU-Pro vision |
CharXiv reas. |
CharXiv desc. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| VLAA-Thinker-Qwen2.5-7B | 126k | 25k | 68.0 | 26.4 | - | 48.2 | 22.4 | - | 48.5 | - | - | - | - | - |
| ThinkLite-7B-VL | - | 11k | 71.6 | 24.6 | - | 42.9 | 16.5 | - | 42.7 | - | - | - | - | - |
| VL-Rethinker-7B | - | 39k | 73.7 | 28.4 | - | 46.4 | 17.8 | - | 42.7 | - | 41.7 | - | - | - |
| M2-Reasoning | 6.2M | 102k | 75.0 | 42.1 | - | 40.4 | - | - | 50.6 | - | - | - | - | - |
| MMR1 | 1.6M | 15k | 72.0 | 31.8 | 29.0† | 55.4 | 27.9† | 68.0† | 48.9 | 52.4† | 41.1† | 37.1† | 43.5† | 71.1† |
| OpenVLThinker-7B | 3.3k | 9.6k | 65.3 | 23.0 | 26.9† | 38.1 | 16.8 | 61.9† | 44.5 | 55.1† | 39.7† | 38.4† | 41.0† | 69.2† |
| MM-Eureka-Qwen-7B | - | 15.6k | 72.6 | 28.1 | 32.1† | 45.4 | 23.0 | 59.8† | 46.3 | 54.4† | 40.1† | 37.1† | 42.4† | 74.1† |
| OVR-7B | 2M | 300k | 72.1 | 51.8 | 38.2† | 54.6 | 33.5 | 64.8 | 54.8 | 51.8† | 50.2 | 29.1† | 44.5 | 73.6 |
| OMR-7B (ours) | 874k | 74k | 79.5 | 43.6 | 38.8 | 63.8 | 34.9 | 79.0 | 50.0 | 57.8 | 44.1 | 40.6 | 46.1 | 73.5 |
Note: Bold numbers indicate the best performance, and † indicates results reproduced using the authors' checkpoints.
If you find OpenMMReasoner useful for your research and applications, please cite using this BibTeX:
@misc{zhang2025openmmreasonerpushingfrontiersmultimodal,
title={OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe},
author={Kaichen Zhang and Keming Wu and Zuhao Yang and Kairui Hu and Bin Wang and Ziwei Liu and Xingxuan Li and Lidong Bing},
year={2025},
eprint={2511.16334},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.16334},
}We gratefully acknowledge the following open-source projects that made this work possible:
- lmms-eval for providing the comprehensive evaluation framework for large multimodal models.
- lmms-engine for the SFT training infrastructure and tools.
- verl for the reinforcement learning training framework.
We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.

