😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

Topic	Description
LVLM Model	Large multimodal models / Foundation Model
Multimodal Benchmark & Dataset	😍 Interesting Multimodal Benchmark and Dataset
LVLM Agent	Agent & Application of LVLM
LVLM Hallucination	Benchmark & Methods for Hallucination

🏗️ LVLM Models

Title	Venue/Date	Note	Code
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	InstructBLIP	Github
Visual Instruction Tuning	NeurIPS 2023	LLaVA	GitHub
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	2023-04	mPLUG	Github
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	2023-04	MiniGPT-4	Github
TextBind: Multi-turn Interleaved Multimodal Instruction-following	2023-09	TextBind	Github
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	2023-09	BLIP-Diffusion	Github
NExT-GPT: Any-to-Any Multimodal LLM	2023-09	NeXT-GPT	Github
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	ICLR 2024	Multi-image Reasoning	Github
Ferret: Refer and Ground Anything Anywhere at Any Granularity	ICLR 2024	Grounding	Github
LLaVA-OneVision: Easy Visual Task Transfer	Technical Report 2024-7	LLaVA-OV: Blog with details	Project
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	Technical Report 2024-10	Qwen2-VL: Dynamic resolution & Multi-images & Video	Github
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Technical Report 2024-12	Deepseek-VL2: MOE Tiny: 1B, Small: 3B DeepSeek-VL2: 5B	Github
DeepSeek-V3 Technical Report	Technical Report 2024-12	🧠 671B MoE parameters 🚀 37B activated 📚 14.8T tokens Blog	Project
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search	2024-12	Monte Carlo Tree Search MLLM	Project
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models	ICLR 2025	Long Image Sequence	Project
Temporal Reasoning Transfer from Text to Video	ICLR 2025	Temporal Video	Project
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos	2025-01	Without LLM to Learning the Video	Project
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding	2025-01	Video LLaMA Series	Project

📆 Multimodal Benchamrk & Dataset

Title	Venue/Date	Note	Code
MMMU: A Massive Multi-discipline Multimodal	CVPR 2024	11K Multimodal Questions Reasoning Benchmark	project
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought	ACL 2024	Multimodal COT: Multi-step visual modal reasoning	project
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor	MM 2024	Multimodal Correction	Github
Right this way: Can VLMs Guide Us to See More to Answer Questions?	NeurIPS 2024	For visually impaired people	Github
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models	NeurIPS 2024	Multimodal Refinement 100K data	Project
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model	EMNLP 2024	Abstract Image Reasoning Benchmark	Project
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning	AAAI 2025	Math Reasoning & Weak2Strong Data	Project
Multimodal Situational Safety	ICLR 2025	Multimodal Safety Benchmark	Project
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos	ICLR 2025	MMMU in Video QA	Project
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?	ICLR 2025	High Resolution Image	Project
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining	2025-01	Educational Video to Textbook	Project
Holistic Evaluation for Interleaved Text-and-Image Generation	EMNLP 2024	Interleaved Text-Image Generation Benchmark	Project
A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation	2024-11	Interleaved T-I Generation More Scenarios Judge Model	Project
An Enhanced MultiModal ReAsoning Benchmark	2025-01	Multimodal COT	Project

🎛️ LVLM Agent

Title	Venue/Date	Note	Code
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	2023-03	MM-REACT	Github
Visual Programming: Compositional visual reasoning without training	CVPR 2023 Best Paper	VISPROG (Similar to ViperGPT)	Github
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	2023-03	HuggingfaceGPT	Github
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	2023-04	Chameleon	Github
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	2023-05	IdealGPT	Github
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	2023-06	AssistGPT	Github
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning	ACM MM 2024	Multi-Agent Debate	Github
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models	NeurIPS 2024	Draw to facilitate reasoning	Project

🤕 LVLM Hallunication

Title	Venue/Date	Note	Code
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP 2023	Simple Object Hallunicattion Evaluation - POPE	Github
Evaluation and Analysis of Hallucination in Large Vision-Language Models	2023-10	Hallunicattion Evaluation - HaELM	Github
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	2023-06	GPT4-Assisted Visual Instruction Evaluation (GAVIE) & LRV-Instruction	Github
Woodpecker: Hallucination Correction for Multimodal Large Language Models	2023-10	First work to correct hallucinations in LVLMs	Github
Can We Edit Multimodal Large Language Models?	EMNLP 2023	Knowledge Editing Benchmark	Github
Grounding Visual Illusions in Language:Do Vision-Language Models Perceive Illusions Like Humans?	EMNLP 2023	Similar to human illusion?	Github
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models	2024-11	Vision-language generative reward	Project

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

🏗️ LVLM Models

📆 Multimodal Benchamrk & Dataset

🎛️ LVLM Agent

🤕 LVLM Hallunication

About

Releases

Packages

Gary-code/Awesome-LVLM-paper

Folders and files

Latest commit

History

Repository files navigation

😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

🏗️ LVLM Models

📆 Multimodal Benchamrk & Dataset

🎛️ LVLM Agent

🤕 LVLM Hallunication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages