Skip to content

EvolvingLMMs-Lab/OpenMMReasoner

Repository files navigation

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner Cover

Models Data Paper Project Page Github Static Badge

🎉 News

Table of Contents

Overview

Benchmark Results

Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research.

In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research.

Installation

1. SFT Training

Please follow the installation instructions in lmms-engine to prepare the environment for supervised fine-tuning.

2. RL Training

We provide our source verl code, which is a detached fork from the original verl. You can choose to use either our version (included in this repository) or the original verl for RL training.

The installation steps are similar to the standard verl setup. Please follow the instruction from verl to install all the requirements with an updated version of vllm. Additionally, you need to install math-verify to use our reward function:

pip install math-verify

For our RL training pipeline, we use the following package versions:

  • transformers==4.57.1
  • vllm==0.11.0

3. Evaluation

Please follow the installation instructions in lmms-eval to set up the evaluation environment.

4. Data Pipeline

We open-sourced our data processing pipeline and code for the community to follow. To install requirements for Data Pipeline:

cd ./data_pipeline

uv pip install -e .

We recommend you to use separate environments if you encounter a conflict in requirements.

Getting Started

Data Preparation

We provide a convenient script to download all the required datasets from Hugging Face:

bash examples/openmmreasoner/download_data.sh [LOCAL_DIR]

This script will download both the SFT (874K samples) and RL (74K samples) datasets to your specified directory (defaults to ./data).

SFT Training

After installing lmms-engine, you can launch SFT training using either:

Option 1: Using a configuration YAML file

# Edit the dataset paths in sft_example_config.yaml
torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="8000" \
    -m lmms_engine.launch.cli config_yaml=${CONFIG}

Option 2: Using the launch script

# Edit the dataset paths and hyperparameters in the script
bash examples/openmmreasoner/sft_example_launch.sh

Troubleshooting:

  • If you encounter OOM (Out of Memory) errors, reduce the packing_length parameter in your configuration.
  • If mixing text and image data causes a hang, consider adding a blank dummy image for text-only samples in the m1 dataset.

RL Training

We provide two example scripts for RL training:

Option 1: Local training

bash examples/openmmreasoner/gspo_n16.sh

Option 2: Training with Ray

To launch training in multi-node environment, you should first setup ray on your head and worker node. Then submit the job as in the bash script.

bash examples/openmmreasoner/gspo_ray.sh

Make sure to update the DATA_FOLDER and PROJECT_FOLDER paths in the scripts before launching.

Evaluation

After setting up lmms-eval, use the provided evaluation script:

bash examples/openmmreasoner/eval.sh <CHECKPOINT_PATH> <TASK_NAME>

Image Tasks:

bash examples/openmmreasoner/eval.sh /path/to/checkpoint "mmmu_reasoning_reward,wemath_testmini_thinking,mmmu_pro_vision_cot_reward,mmmu_pro_standard_cot_reward,mathvista_testmini_cot_reward,mathvision_reason_testmini_reward,mathvision_reason_test_reward,mathverse_testmini_reward,logicvista_thinking,dynamath,charxiv_val_descriptive_cot,charxiv_val_reasoning_cot"

Text Tasks:

bash examples/openmmreasoner/eval.sh /path/to/checkpoint "gpqa_diamond_thinking,aime_agg8"

LLM Judge Setup

We use an LLM as judge for both evaluation and RL reward calculation. Our default judge model is Qwen/Qwen3-235B-A22B-Instruct-2507.

Steps:

  1. Set up a server using vLLM or SGLang:
# Example with SGLang
python3 -m sglang.launch_server --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
     --tp-size 8 \
     --dp-size 1 \
     --served-model-name judge \
     --port 8000 \
     --host 0.0.0.0 --mem-fraction-static 0.75
  1. Update the judge service address in your scripts:
    • For RL training: Update OPENAI_BASE_URL in gspo_n16.sh or gspo_ray.sh
    • For evaluation: Update OPENAI_BASE_URL in eval.sh
export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://your-judge-server-address:8000/v1"
export OPENAI_MODEL_NAME="judge"
export USE_LLM_JUDGE="True"

Data Processing Pipeline

To follow our data processing pipeline, we provide example scripts in data_pipeline/examples/. The pipeline supports two main operations:

Deduplicating RL Data

To deduplicate RL training data, follow these steps:

  1. Prepare the RL configuration: Create a YAML config file based on data_pipeline/examples/example_rl_config.yaml:
datasets:
  - path: /path/to/your/dataset.parquet
    data_folder: "/path/to/images"
    data_type: parquet
  1. Run embedding: Generate embeddings for the dataset:
cd data_pipeline
bash examples/embed_data.sh /path/to/your_rl_config.yaml cache/embed rl
  1. Run deduplication: Remove duplicates based on embeddings:
bash examples/deduplicate_data.sh /path/to/your_rl_config.yaml cache/embed rl cache/deduplicate

Distilling Dataset

To distill a dataset using a teacher model:

  1. Prepare the SFT configuration: Create a YAML config file based on data_pipeline/examples/example_sft_config.yaml:
datasets:
  - path: /path/to/your/dataset.parquet
    data_folder: "/path/to/images"
    data_type: parquet
  1. Run distillation: Edit data_pipeline/examples/distill_dataset.sh to set your server addresses, then run:
cd data_pipeline
bash examples/distill_dataset.sh

Make sure to configure the model server and judge server URLs in the script before running.

Evaluation Results

Our OpenMMReasoner-7B (OMR-7B) model demonstrates strong performance across a comprehensive suite of multimodal reasoning benchmarks. With only 874K SFT samples and 74K RL samples—significantly less data than many competing methods—our model achieves state-of-the-art or highly competitive results on 9 out of 14 benchmark tasks. Notably, OMR-7B achieves 79.5% on MathVista testmini (best among all models), 63.8% on MathVerse testmini (best), and 79.0% on WeMath loose (best), demonstrating the effectiveness of our transparent two-stage training recipe. This performance validates our emphasis on data quality and rigorous training design over simply scaling dataset size.

Model SFT Data RL Data MathVista
testmini
MathVision
test
MathVision
testmini
MathVerse
testmini
DynaMath
worst
WeMath
loose
LogicVista
test
MMMU
val
MMMU-Pro
standard
MMMU-Pro
vision
CharXiv
reas.
CharXiv
desc.
VLAA-Thinker-Qwen2.5-7B 126k 25k 68.0 26.4 - 48.2 22.4 - 48.5 - - - - -
ThinkLite-7B-VL - 11k 71.6 24.6 - 42.9 16.5 - 42.7 - - - - -
VL-Rethinker-7B - 39k 73.7 28.4 - 46.4 17.8 - 42.7 - 41.7 - - -
M2-Reasoning 6.2M 102k 75.0 42.1 - 40.4 - - 50.6 - - - - -
MMR1 1.6M 15k 72.0 31.8 29.0† 55.4 27.9† 68.0† 48.9 52.4† 41.1† 37.1† 43.5† 71.1†
OpenVLThinker-7B 3.3k 9.6k 65.3 23.0 26.9† 38.1 16.8 61.9† 44.5 55.1† 39.7† 38.4† 41.0† 69.2†
MM-Eureka-Qwen-7B - 15.6k 72.6 28.1 32.1† 45.4 23.0 59.8† 46.3 54.4† 40.1† 37.1† 42.4† 74.1†
OVR-7B 2M 300k 72.1 51.8 38.2† 54.6 33.5 64.8 54.8 51.8† 50.2 29.1† 44.5 73.6
OMR-7B (ours) 874k 74k 79.5 43.6 38.8 63.8 34.9 79.0 50.0 57.8 44.1 40.6 46.1 73.5

Note: Bold numbers indicate the best performance, and † indicates results reproduced using the authors' checkpoints.

Citation

If you find OpenMMReasoner useful for your research and applications, please cite using this BibTeX:

@misc{zhang2025openmmreasonerpushingfrontiersmultimodal,
      title={OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe}, 
      author={Kaichen Zhang and Keming Wu and Zuhao Yang and Kairui Hu and Bin Wang and Ziwei Liu and Xingxuan Li and Lidong Bing},
      year={2025},
      eprint={2511.16334},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.16334}, 
}

Acknowledgements

We gratefully acknowledge the following open-source projects that made this work possible:

  • lmms-eval for providing the comprehensive evaluation framework for large multimodal models.
  • lmms-engine for the SFT training infrastructure and tools.
  • verl for the reinforcement learning training framework.

We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.

⭐ Star History

Star History Chart

About

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages