Skip to content

Gary-code/Awesome-LVLM-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 

Repository files navigation

😎 Awesome-LVLMs

Related Collection

Our Paper Reading List

Topic Description
LVLM Model Large multimodal models / Foundation Model
Multimodal Benchmark & Dataset 😍 Interesting Multimodal Benchmark and Dataset
LVLM Agent Agent & Application of LVLM
LVLM Hallucination Benchmark & Methods for Hallucination

πŸ—οΈ LVLM Models

Title Venue/Date Note Code Picture
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
NeurIPS 2023 InstructBLIP Github instrucblip
Star
Visual Instruction Tuning
NeurIPS 2023 LLaVA GitHub llava
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
2023-04 mPLUG Github image-20241221163809570
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
2023-04 MiniGPT-4 Github minigpt-4
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
2023-09 TextBind Github textbind
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
2023-09 BLIP-Diffusion Github blip-diffusion
Star
NExT-GPT: Any-to-Any Multimodal LLM
2023-09 NeXT-GPT Github next-gpt
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
ICLR 2024 Multi-image Reasoning Github VPG
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
ICLR 2024 Grounding Github ferret
Star
LLaVA-OneVision: Easy Visual Task Transfer
Technical Report 2024-7 LLaVA-OV: Blog with details Project image-20241221110841873
Star
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
Technical Report 2024-10 Qwen2-VL: Dynamic resolution & Multi-images & Video Github image-20241221105930185
Star
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Technical Report 2024-12 Deepseek-VL2: MOE
Tiny: 1B, Small: 3B DeepSeek-VL2: 5B
Github image-20241221110551477
Star
DeepSeek-V3 Technical Report
Technical Report 2024-12 🧠 671B MoE parameters
πŸš€ 37B activated
πŸ“š 14.8T tokens
Blog
Project image-20241228113108108
Star
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
2024-12 Monte Carlo Tree Search
MLLM
Project image-20250103103936495
Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
ICLR 2025 Long Image Sequence Project image-20250124000043437
Star
Temporal Reasoning Transfer from Text to Video
ICLR 2025 Temporal Video Project image-20250124000955736
Star
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
2025-01 Without LLM to Learning the Video Project image-20250124000524236
Star
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
2025-01 Video LLaMA Series Project image-20250124002058498

πŸ“† Multimodal Benchamrk & Dataset

Title Venue/Date Note Code Picture
Star
MMMU: A Massive Multi-discipline Multimodal
CVPR 2024 11K Multimodal Questions Reasoning Benchmark project algebraic reasoning
Star
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
ACL 2024 Multimodal COT: Multi-step visual modal reasoning project image-20241221112255186
Star
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor
MM 2024 Multimodal Correction Github image-20241221112534221
Star
Right this way: Can VLMs Guide Us to See More to Answer Questions?
NeurIPS 2024 For visually impaired people Github image-20241221163141433
Star
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
NeurIPS 2024 Multimodal Refinement 100K data Project image-20241226104808366
Star
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
EMNLP 2024 Abstract Image Reasoning Benchmark Project image-20241227103310143
Star
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
AAAI 2025 Math Reasoning &
Weak2Strong Data
Project image-20241226105753500
Star
Multimodal Situational Safety
ICLR 2025 Multimodal Safety Benchmark Project image-20241223102926454
Star
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025 MMMU in Video QA Project image-20241223103309015
Star
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
ICLR 2025 High Resolution Image Project image-20250124001834115
Star
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
2025-01 Educational Video to Textbook Project image-20250108115419174
Star
Holistic Evaluation for Interleaved Text-and-Image Generation
EMNLP 2024 Interleaved Text-Image Generation Benchmark Project image-20250116160734900
Star
A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
2024-11 Interleaved T-I Generation
More Scenarios
Judge Model
Project image-20250116161219634
Star
An Enhanced MultiModal ReAsoning Benchmark
2025-01 Multimodal COT Project image-20250123235642175

πŸŽ›οΈ LVLM Agent

Title Venue/Date Note Code Picture
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
2023-03 MM-REACT Github mm-react
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2023 Best Paper VISPROG (Similar to ViperGPT) Github vp
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
2023-03 HuggingfaceGPT Github huggingface-gpt
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
2023-04 Chameleon Github chameleon
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
2023-05 IdealGPT Github ideal-gpt
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
2023-06 AssistGPT Github assist-gpt
Star
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning
ACM MM 2024 Multi-Agent Debate Github image-20241221111626526
Star
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
NeurIPS 2024 Draw to facilitate reasoning Project image-20241225110818819

πŸ€• LVLM Hallunication

Title Venue/Date Note Code Picture
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023 Simple Object Hallunicattion Evaluation - POPE Github pope
Star
Evaluation and Analysis of Hallucination in Large Vision-Language Models
2023-10 Hallunicattion Evaluation - HaELM Github HaELM
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
2023-06 GPT4-Assisted Visual Instruction Evaluation (GAVIE) & LRV-Instruction Github gavie
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
2023-10 First work to correct hallucinations in LVLMs Github Woodpecker
Star
Can We Edit Multimodal Large Language Models?
EMNLP 2023 Knowledge Editing Benchmark Github mm-edit
Star
Grounding Visual Illusions in Language:Do Vision-Language Models Perceive Illusions Like Humans?
EMNLP 2023 Similar to human illusion? Github illusion
Star
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
2024-11 Vision-language generative reward Project image-20241221163651585

About

😎 List of papers about Large Multimodal model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published