Skip to content

htw2012/Awesome-Multimodal-Reasoning

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 

Repository files navigation

Awesome-Multimodal-Reasoning

Contributions are most welcome, if you have any suggestions or improvements, feel free to create an issue or raise a pull request.

Contents

Model

Image MLLM

Date Project SFT RL Task
25.03 https://github.com/PzySeere/MetaSpatial GRPO 3D spatial reasoning
25.03 CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation [📑Paper] 260k sft - Multi-Image Benchmark
25.03 VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [📑Paper][model][data][benchmark] - Process Reward Model Math & MMMU
25.03 R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization [📑Paper][Project website][🖥️Code] 155k R1-OneVision GRPO Math
25.03 MMR1: Advancing the Frontiers of Multimodal Reasoning [🖥️Code] - GRPO Math
25.03 (CVPR2025) GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [📑Paper] - GFlowNets NumberLine (NL) and BlackJack (BJ)
25.03 VisRL: Intention-Driven Visual Perception via Reinforced Reasoning [📑Paper][🖥️Code] warm up DPO Various VQA
25.03 Visual-RFT: Visual Reinforcement Fine-Tuning [📑Paper][🖥️Code] - GRPO Detection, Grounding, Classification
25.03 LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL [📑Paper][🖥️Code] - PPO Math, Sokoban-Global, Football-Online
25.03 Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [📑Paper] Self-Improvement Training GRPO Detection, Classification, Math
25.03 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [📑Paper][🖥️Code] - GRPO Math
25.03 Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [📑Paper][🖥️Code] - GRPO RefCOCO&ReasonSeg
25.03 R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model [📑Paper][🖥️Code] - GRPO CVBench
25.03 MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [📑Paper][🖥️Code] - RLOO Math
25.03 Unified Reward Model for Multimodal Understanding and Generation [📑Paper][🖥️Code] - DPO Various VQA & Generation
25.03 EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework [🖥️Code] - GRPO Geometry3K
25.02 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper][🖥️Code] - DPO with 120k fine-grained, human-annotated preference comparison pairs. Reward & Various VQA
25.02 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] 200k sft data DPO Alignment & Various VQA
25.02 Multimodal Open R1 [🖥️Code] - GRPO Mathvista-mini, MMMU
25.02 VLM-R1: A stable and generalizable R1-style Large Vision-Language Model [🖥️Code] - GRPO Referring Expression Comprehension
25.02 R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3 [🖥️Code] - GRPO Item Counting, Number Related Reasoning and Geometry Reasoning
25.01 Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [📑Paper][🖥️Code] 2k Text data from R1/QwQ and visual data from QvQ/SD - Math & MMMU
25.01 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [📑Paper][🖥️Code] - PPO Reward & Various VQA
25.01 LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] LLaVA-CoT-100k & PixMo [13] subset - VRC-Bench & Various VQA
24.12 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [📑Paper][🖥️Code] 260k reasoning and reflection sft data by Collective MCTS - Various VQA
24.11 LLaVA-CoT: Let Vision Language Models Reason Step-by-Step [📑Paper][🖥️Code] LLaVA-CoT-100k by GPT4-o - Various VQA
24.11 Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models [📑Paper][🖥️Code] sft for agent Iterative DPO Various VQA
24.11 Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [📑Paper] - MPO Various VQA
24.10 Improve Vision Language Model Chain-of-thought Reasoning [📑Paper][🖥️Code] 193k CoT sft data by GPT4-o DPO Various VQA
24.03 Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [📑Paper][🖥️Code] visual chain-of-thought dataset comprising 438k data items - Various VQA

Video MLLM

Date Project SFT RL Task
25.03 R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning [📑Paper][🖥️Code] cold start GRPO Emotion recognition
25.02 video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model [📑Paper] cold start DPO various video QA
25.02 Open-R1-Video[🖥️Code] - GRPO LongVideoBench
25.02 Video-R1: Towards Super Reasoning Ability in Video Understanding [🖥️Code] - GRPO DVD-counting
25.01 Temporal Preference Optimization for Long-Form Video Understanding [📑Paper][🖥️Code] - DPO various video QA
25.01 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding [📑Paper][🖥️Code] main training DPO Video caption & QA

Image/Video Generation

Date Proj Comment
25.03 GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing[📑Paper] A reasoning-guided framework for generation and editing.
25.02 C-Drag:Chain-of-Thought Driven Motion Controller for Video Generation[📑Paper] Calculate simple motion vector with LLM.
25.01 Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step[📑Paper] Potential Assessment Reward Model for AR Image Generation.
25.01 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought[📑Paper] Visualization-of-Thought
25.01 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding[📑Paper] Draw something!
24.12 EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing[📑Paper] Thinking in text space with a caption model.

LLM

Date Project Comment
23.02 Multimodal Chain-of-Thought Reasoning in Language Models [📑Paper] [🖥️Code]

Benchmark

Date Project Task
25.03 SCIVERSE: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems [📑Paper] SCIVERSE
25.03 Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning [📑Paper][Data] 3D-CoT
25.02 MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models [📑Paper][🖥️Code] MM-IQ
25.02 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [📑Paper] MM-RLHF-RewardBench, MM-RLHF-SafetyBench
25.02 MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency [📑Paper][🖥️Code] MME-CoT
25.02 OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference [📑Paper][🖥️Code] MM-AlignBench
25.01 LlamaV-o1: Rethinking Step-By-Step Visual Reasoning in LLMs [📑Paper][🖥️Code] VRCBench
24.11 VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [📑Paper] VLRewardBench
24.05 M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought [📑Paper] M3CoT

Data

Date Project Comment
24.11 VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection[📑Paper][🖥️Code] various video QA

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published