Contributions are most welcome, if you have any suggestions or improvements, feel free to create an issue or raise a pull request.
Date | Project | SFT | RL | Task |
---|---|---|---|---|
25.03 | https://github.com/PzySeere/MetaSpatial | GRPO | 3D spatial reasoning | |
25.03 | CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation [📑Paper] | 260k sft | - | Multi-Image Benchmark |
25.03 | VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [📑Paper][model][data][benchmark] | - | Process Reward Model | Math & MMMU |
25.03 | R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [📑Paper][Project website][🖥️Code] | 155k R1-OneVision | GRPO | Math |
25.03 | MMR1: Advancing the Frontiers of Multimodal Reasoning [🖥️Code] | - | GRPO | Math |
25.03 (CVPR2025) | GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [📑Paper] | - | GFlowNets | NumberLine (NL) and BlackJack (BJ) |
25.03 | VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [📑Paper][🖥️Code] | warm up | DPO | Various VQA |
25.03 | Visual-RFT: Visual Reinforcement Fine-Tuning [📑Paper][🖥️Code] | - | GRPO | Detection, Grounding, Classification |
25.03 | LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [📑Paper][🖥️Code] | - | PPO | Math, Sokoban-Global, Football-Online |
25.03 | Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [📑Paper] | Self-Improvement Training | GRPO | Detection, Classification, Math |
25.03 | Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [📑Paper][🖥️Code] | - | GRPO | Math |
25.03 | Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [📑Paper][🖥️Code] | - | GRPO | RefCOCO&ReasonSeg |
25.03 | R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model [📑Paper][🖥️Code] | - | GRPO | CVBench |
25.03 | MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [📑Paper][🖥️Code] | - | RLOO | Math |
25.03 | Unified Reward Model for Multimodal Understanding and Generation [📑Paper][🖥️Code] | - | DPO | Various VQA & Generation |
25.03 | EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [🖥️Code] | - | GRPO | Geometry3K |
25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper][🖥️Code] | - | DPO with 120k fine-grained, human-annotated preference comparison pairs. | Reward & Various VQA |
25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] | 200k sft data | DPO | Alignment & Various VQA |
25.02 | Multimodal Open R1 [🖥️Code] | - | GRPO | Mathvista-mini, MMMU |
25.02 | VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [🖥️Code] | - | GRPO | Referring Expression Comprehension |
25.02 | R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [🖥️Code] | - | GRPO | Item Counting, Number Related Reasoning and Geometry Reasoning |
25.01 | Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [📑Paper][🖥️Code] | 2k Text data from R1/QwQ and visual data from QvQ/SD | - | Math & MMMU |
25.01 | InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [📑Paper][🖥️Code] | - | PPO | Reward & Various VQA |
25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] | LLaVA-CoT-100k & PixMo [13] subset | - | VRC-Bench & Various VQA |
24.12 | Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [📑Paper][🖥️Code] | 260k reasoning and reflection sft data by Collective MCTS | - | Various VQA |
24.11 | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [📑Paper][🖥️Code] | LLaVA-CoT-100k by GPT4-o | - | Various VQA |
24.11 | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [📑Paper][🖥️Code] | sft for agent | Iterative DPO | Various VQA |
24.11 | Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [📑Paper] | - | MPO | Various VQA |
24.10 | Improve Vision Language Model Chain-of-thought Reasoning [📑Paper][🖥️Code] | 193k CoT sft data by GPT4-o | DPO | Various VQA |
24.03 | Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [📑Paper][🖥️Code] | visual chain-of-thought dataset comprising 438k data items | - | Various VQA |
Date | Project | SFT | RL | Task |
---|---|---|---|---|
25.03 | R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [📑Paper][🖥️Code] | cold start | GRPO | Emotion recognition |
25.02 | video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [📑Paper] | cold start | DPO | various video QA |
25.02 | Open-R1-Video[🖥️Code] | - | GRPO | LongVideoBench |
25.02 | Video-R1: Towards Super Reasoning Ability in Video Understanding [🖥️Code] | - | GRPO | DVD-counting |
25.01 | Temporal Preference Optimization for Long-Form Video Understanding [📑Paper][🖥️Code] | - | DPO | various video QA |
25.01 | Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [📑Paper][🖥️Code] | main training | DPO | Video caption & QA |
Date | Proj | Comment |
---|---|---|
25.03 | GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing[📑Paper] | A reasoning-guided framework for generation and editing. |
25.02 | C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation[📑Paper] | Calculate simple motion vector with LLM. |
25.01 | Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step[📑Paper] | Potential Assessment Reward Model for AR Image Generation. |
25.01 | Imagine while Reasoning in Space: Multimodal Visualization-of-Thought[📑Paper] | Visualization-of-Thought |
25.01 | ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding[📑Paper] | Draw something! |
24.12 | EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing[📑Paper] | Thinking in text space with a caption model. |
Date | Project | Comment |
---|---|---|
23.02 | Multimodal Chain-of-Thought Reasoning in Language Models [📑Paper] [🖥️Code] |
Date | Project | Task |
---|---|---|
25.03 | SCIVERSE: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems [📑Paper] | SCIVERSE |
25.03 | Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning [📑Paper][Data] | 3D-CoT |
25.02 | MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [📑Paper][🖥️Code] | MM-IQ |
25.02 | MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper] | MM-RLHF-RewardBench, MM-RLHF-SafetyBench |
25.02 | MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency [📑Paper][🖥️Code] | MME-CoT |
25.02 | OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] | MM-AlignBench |
25.01 | LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] | VRCBench |
24.11 | VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [📑Paper] | VLRewardBench |
24.05 | M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [📑Paper] | M3CoT |
Date | Project | Comment |
---|---|---|
24.11 | VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection[📑Paper][🖥️Code] | various video QA |