Awesome-Multimodal-Reasoning

Contributions are most welcome, if you have any suggestions or improvements, feel free to create an issue or raise a pull request.

Model

Image MLLM

Date	Project	SFT	RL	Task
25.03	https://github.com/PzySeere/MetaSpatial		GRPO	3D spatial reasoning
25.03	CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation [📑Paper]	260k sft	-	Multi-Image Benchmark
25.03	VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [📑Paper][model][data][benchmark]	-	Process Reward Model	Math & MMMU
25.03	R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [📑Paper][Project website][🖥️Code]	155k R1-OneVision	GRPO	Math
25.03	MMR1: Advancing the Frontiers of Multimodal Reasoning [🖥️Code]	-	GRPO	Math
25.03 (CVPR2025)	GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [📑Paper]	-	GFlowNets	NumberLine (NL) and BlackJack (BJ)
25.03	VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [📑Paper][🖥️Code]	warm up	DPO	Various VQA
25.03	Visual-RFT: Visual Reinforcement Fine-Tuning [📑Paper][🖥️Code]	-	GRPO	Detection, Grounding, Classification
25.03	LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [📑Paper][🖥️Code]	-	PPO	Math, Sokoban-Global, Football-Online
25.03	Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [📑Paper]	Self-Improvement Training	GRPO	Detection, Classification, Math
25.03	Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [📑Paper][🖥️Code]	-	GRPO	Math
25.03	Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [📑Paper][🖥️Code]	-	GRPO	RefCOCO&ReasonSeg
25.03	R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model [📑Paper][🖥️Code]	-	GRPO	CVBench
25.03	MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [📑Paper][🖥️Code]	-	RLOO	Math
25.03	Unified Reward Model for Multimodal Understanding and Generation [📑Paper][🖥️Code]	-	DPO	Various VQA & Generation
25.03	EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [🖥️Code]	-	GRPO	Geometry3K
25.02	MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper][🖥️Code]	-	DPO with 120k fine-grained, human-annotated preference comparison pairs.	Reward & Various VQA
25.02	OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code]	200k sft data	DPO	Alignment & Various VQA
25.02	Multimodal Open R1 [🖥️Code]	-	GRPO	Mathvista-mini, MMMU
25.02	VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [🖥️Code]	-	GRPO	Referring Expression Comprehension
25.02	R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [🖥️Code]	-	GRPO	Item Counting, Number Related Reasoning and Geometry Reasoning
25.01	Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [📑Paper][🖥️Code]	2k Text data from R1/QwQ and visual data from QvQ/SD	-	Math & MMMU
25.01	InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [📑Paper][🖥️Code]	-	PPO	Reward & Various VQA
25.01	LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code]	LLaVA-CoT-100k & PixMo [13] subset	-	VRC-Bench & Various VQA
24.12	Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [📑Paper][🖥️Code]	260k reasoning and reflection sft data by Collective MCTS	-	Various VQA
24.11	LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [📑Paper][🖥️Code]	LLaVA-CoT-100k by GPT4-o	-	Various VQA
24.11	Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [📑Paper][🖥️Code]	sft for agent	Iterative DPO	Various VQA
24.11	Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [📑Paper]	-	MPO	Various VQA
24.10	Improve Vision Language Model Chain-of-thought Reasoning [📑Paper][🖥️Code]	193k CoT sft data by GPT4-o	DPO	Various VQA
24.03	Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [📑Paper][🖥️Code]	visual chain-of-thought dataset comprising 438k data items	-	Various VQA

Video MLLM

Date	Project	SFT	RL	Task
25.03	R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [📑Paper][🖥️Code]	cold start	GRPO	Emotion recognition
25.02	video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [📑Paper]	cold start	DPO	various video QA
25.02	Open-R1-Video[🖥️Code]	-	GRPO	LongVideoBench
25.02	Video-R1: Towards Super Reasoning Ability in Video Understanding [🖥️Code]	-	GRPO	DVD-counting
25.01	Temporal Preference Optimization for Long-Form Video Understanding [📑Paper][🖥️Code]	-	DPO	various video QA
25.01	Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [📑Paper][🖥️Code]	main training	DPO	Video caption & QA

Image/Video Generation

Date	Proj	Comment
25.03	GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing[📑Paper]	A reasoning-guided framework for generation and editing.
25.02	C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation[📑Paper]	Calculate simple motion vector with LLM.
25.01	Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step[📑Paper]	Potential Assessment Reward Model for AR Image Generation.
25.01	Imagine while Reasoning in Space: Multimodal Visualization-of-Thought[📑Paper]	Visualization-of-Thought
25.01	ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding[📑Paper]	Draw something!
24.12	EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing[📑Paper]	Thinking in text space with a caption model.

LLM

Date	Project	Comment
23.02	Multimodal Chain-of-Thought Reasoning in Language Models [📑Paper] [🖥️Code]

Benchmark

Date	Project	Task
25.03	SCIVERSE: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems [📑Paper]	SCIVERSE
25.03	Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning [📑Paper][Data]	3D-CoT
25.02	MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [📑Paper][🖥️Code]	MM-IQ
25.02	MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper]	MM-RLHF-RewardBench, MM-RLHF-SafetyBench
25.02	MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency [📑Paper][🖥️Code]	MME-CoT
25.02	OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code]	MM-AlignBench
25.01	LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code]	VRCBench
24.11	VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [📑Paper]	VLRewardBench
24.05	M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [📑Paper]	M3CoT

Data

Date	Project	Comment
24.11	VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection[📑Paper][🖥️Code]	various video QA

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-Multimodal-Reasoning

Contents

Model

Image MLLM

Video MLLM

Image/Video Generation

LLM

Benchmark

Data

About

Uh oh!

Releases

Packages

htw2012/Awesome-Multimodal-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Reasoning

Contents

Model

Image MLLM

Video MLLM

Image/Video Generation

LLM

Benchmark

Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages