InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (2024.07.03)
Pan Zhang, Xiao-wen Dong, Yuhang Zang, Yuhang Cao, Rui Qian, etc
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy (2024.06.28)
Xiang Li, Cristina Mata, Jong Sung Park, Kumara Kahatapitiya, Yoo Sung Jang, etc
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs (2024.06.28)
Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, etc
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression (2024.06.28)
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, etc
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (2024.06.24)
Shengbang Tong, Ellis L Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, etc
VoCo-LLaMA: Towards Vision Compression with Large Language Models (2024.06.18)
Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, etc
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models (2024.06.12)
Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, etc
An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models (2024.06.07)
Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, V'ictor Guti'errez-Basulto, etc
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning (2024.06.04)
Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, etc
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models (2024.05.31)
Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, etc . - 【arXiv.org】
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024.05.31)
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, etc . - 【arXiv.org】
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models (2024.05.24)
Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, etc . - 【arXiv.org】
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models (2024.05.24)
Yue Zhang, Hehe Fan, Yi Yang . - 【arXiv.org】
Probing Multimodal LLMs as World Models for Driving (2024.05.09)
Shiva Sreeram, T. Wang, Alaa Maalouf, G. Rosman, S. Karaman, etc . - 【arXiv.org】
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning (2024.05.09)
Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, etc . - 【arXiv.org】
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (2024.05.09)
Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, etc . - 【arXiv.org】
Auto-Encoding Morph-Tokens for Multimodal LLM (2024.05.03)
Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, etc . - 【arXiv.org】
EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model (2024.05.01)
Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, etc . - 【arXiv.org】
CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models (2024.05.01)
Hongzhan Lin, Zixin Chen, Ziyang Luo, Mingfei Cheng, Jing Ma, etc . - 【arXiv.org】
Training-Free Unsupervised Prompt for Vision-Language Models (2024.04.25)
Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, etc . - 【arXiv.org】
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (2024.04.25)
Gahyeon Kim, Sohee Kim, Seokju Lee . - 【arXiv.org】
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM (2024.04.24)
Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, etc . - 【arXiv.org】
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (2024.04.24)
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqiang Li, Han Lin, etc . - 【arXiv.org】
Unified Scene Representation and Reconstruction for 3D Large Language Models (2024.04.19)
Tao Chu, Pan Zhang, Xiao-wen Dong, Yuhang Zang, Qiong Liu, etc . - 【arXiv.org】
BRAVE: Broadening the visual encoding of vision-language models (2024.04.10)
Ouguzhan Fatih Kar, A. Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, etc
ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling (2024.04.10)
Ege Ozsoy, Chantal Pellegrini, Matthias Keicher, Nassir Navab
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation (2024.04.08)
Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, A. Elgammal, etc
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (2024.04.08)
Keen You, Haotian Zhang, E. Schoop, Floris Weers, Amanda Swearngin, etc
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens (2024.04.04)
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, etc
ViTamin: Designing Scalable Vision Models in the Vision-Language Era (2024.04.02)
Jienneg Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Segment Any 3D Object with Language (2024.04.02)
Seungjun Lee, Yuyang Zhao, Gim Hee Lee
Iterated Learning Improves Compositionality in Large Vision-Language Models (2024.04.02)
Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (2024.03.27)
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, etc . - 【arXiv.org】
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (2024.03.25)
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, etc
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning (2024.03.21)
Hasindri Watawana, Kanchana Ranasinghe, Tariq Mahmood, Muzammal Naseer, Salman Khan, etc
MyVLM: Personalizing VLMs for User-Specific Queries (2024.03.21)
Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (2024.03.21)
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, etc
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model (2024.03.21)
Zheng Zhang, Yeyao Ma, Enming Zhang, Xiang Bai
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models (2024.03.20)
Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, etc
The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? (2024.03.14)
Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, etc
3D-VLA: A 3D Vision-Language-Action Generative World Model (2024.03.14)
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, etc
UniCode: Learning a Unified Codebook for Multimodal Large Language Models (2024.03.14)
Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu
DeepSeek-VL: Towards Real-World Vision-Language Understanding (2024.03.08)
Haoyu Lu, Wen Liu, Bo Zhang, Bing-Li Wang, Kai Dong, etc
VLM-PL: Advanced Pseudo Labeling approach Class Incremental Object Detection with Vision-Language Model (2024.03.08)
Junsu Kim, Yunhoe Ku, Jihyeon Kim, Junuk Cha, Seungryul Baek
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (2024.03.07)
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, etc
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models (2024.03.05)
Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, etc
RegionGPT: Towards Region Understanding Vision Language Model (2024.03.04)
Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, etc
Non-autoregressive Sequence-to-Sequence Vision-Language Models (2024.03.04)
Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, S. Soatto
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2024.02.29)
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, etc
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks (2024.02.27)
Duarte M. Alves, José P. Pombal, Nuno M. Guerreiro, Pedro H. Martins, Joao Alves, etc . - 【arXiv.org】
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction (2024.02.27)
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, etc . - 【arXiv.org】
VRP-SAM: SAM with Visual Reference Prompt (2024.02.27)
Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, etc . - 【arXiv.org】
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation (2024.02.26)
Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, etc
Genie: Generative Interactive Environments (2024.02.23)
Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, etc
LLMBind: A Unified Modality-Task Integration Framework (2024.02.22)
Bin Zhu, Peng Jin, Munan Ning, Bin Lin, Jinfa Huang, etc . - 【arXiv.org】
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts (2024.02.20)
Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan
Video ReCap: Recursive Captioning of Hour-Long Videos (2024.02.20)
Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, etc
An Empirical Study Into What Matters for Calibrating Vision-Language Models (2024.02.12)
Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon . - 【arXiv.org】
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models (2024.02.08)
Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, etc . - 【arXiv.org】
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations (2024.01.31)
Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, etc . - 【arXiv.org】
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (2024.01.29)
Xiao-wen Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, etc . - 【arXiv.org】
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities (2024.01.25)
Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, etc . - 【arXiv.org】
MM-LLMs: Recent Advances in MultiModal Large Language Models (2024.01.24)
Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, etc . - 【arXiv.org】
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models (2024.01.06)
Xin He, Longhui Wei, Lingxi Xie, Qi Tian . - 【arXiv.org】
Learning to Prompt with Text Only Supervision for Vision-Language Models (2024.01.04)
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, L. V. Gool, F. Tombari . - 【arXiv.org】
Instruct-Imagen: Image Generation with Multi-modal Instruction (2024.01.03)
Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, etc . - 【arXiv.org】
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model (2023.12.18)
Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, etc
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects (2023.12.08)
Junyu Lu, Ruyi Gan, Di Zhang, Xiaojun Wu, Ziwei Wu, etc . - 【arXiv.org】
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models (2023.12.05)
Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, etc
Sequential Modeling Enables Scalable Learning for Large Vision Models (2023.12.01)
Yutong Bai, Xinyang Geng, K. Mangalam, Amir Bar, Alan Yuille, etc
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (2023.11.28)
Yanwei Li, Chengyao Wang, Jiaya Jia . - 【arXiv.org】
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers (2023.11.27)
Yawar Siddiqui, A. Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, etc . - 【arXiv.org】
An Embodied Generalist Agent in 3D World (2023.11.18)
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, etc . - 【arXiv.org】
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning (2023.11.17)
Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, S. Azadi, etc . - 【arXiv.org】
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning (2023.11.16)
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, etc . - 【arXiv.org】
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding (2023.11.14)
Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan . - 【arXiv.org】
EviPrompt: A Training-Free Evidential Prompt Generation Method for Segment Anything Model in Medical Images (2023.11.10)
Yinsong Xu, Jiaqi Tang, Aidong Men, Qingchao Chen . - 【arXiv.org】
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model (2023.11.09)
Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, etc . - 【arXiv.org】
Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges (2023.11.06)
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, etc . - 【arXiv.org】
Levels of AGI: Operationalizing Progress on the Path to AGI (2023.11.04)
Meredith Ringel Morris, Jascha Narain Sohl-Dickstein, Noah Fiedel, T. Warkentin, Allan Dafoe, etc . - 【arXiv.org】
Woodpecker: Hallucination Correction for Multimodal Large Language Models (2023.10.24)
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, etc
3D-GPT: Procedural 3D Modeling with Large Language Models (2023.10.19)
Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, etc . - 【arXiv.org】
BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation (2023.10.16)
Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, etc . - 【arXiv.org】
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens (2023.10.03)
Kaizhi Zheng, Xuehai He, Xin Eric Wang . - 【arXiv.org】
Kosmos-2.5: A Multimodal Literate Model (2023.09.20)
Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, etc . - 【arXiv.org】
Investigating the Catastrophic Forgetting in Multimodal Large Language Models (2023.09.19)
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, etc . - 【arXiv.org】
Physically Grounded Vision-Language Models for Robotic Manipulation (2023.09.05)
Jensen Gao, Bidipta Sarkar, F. Xia, Ted Xiao, Jiajun Wu, etc
Physically Grounded Vision-Language Models for Robotic Manipulation (2023.09.05)
Jensen Gao, Bidipta Sarkar, F. Xia, Ted Xiao, Jiajun Wu, etc . - 【arXiv.org】
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, etc
PointLLM: Empowering Large Language Models to Understand Point Clouds (2023.08.31)
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, etc . - 【arXiv.org】
PE-MED: Prompt Enhancement for Interactive Medical Image Segmentation (2023.08.26)
Ao Chang, Xing Tao, Xin Yang, Yuhao Huang, Xinrui Zhou, etc . - 【arXiv.org】
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation (2023.08.22)
Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, etc . - 【arXiv.org】
Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes (2023.08.17)
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao . - 【arXiv.org】
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use (2023.08.12)
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, etc . - 【arXiv.org】
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment (2023.08.08)
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, etc . - 【arXiv.org】
UniVTG: Towards Unified Video-Language Temporal Grounding (2023.07.31)
Kevin Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, etc . - 【arXiv.org】
Med-Flamingo: a Multimodal Medical Few-shot Learner (2023.07.27)
Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, C. Zakka, etc . - 【arXiv.org】
OBJECT 3DIT: Language-guided 3D-aware Image Editing (2023.07.20)
Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, etc . - 【arXiv.org】
Brain2Music: Reconstructing Music from Human Brain Activity (2023.07.20)
Timo I. Denk, Yu Takagi, Takuya Matsuyama, A. Agostinelli, Tomoya Nakai, etc . - 【arXiv.org】
(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs (2023.07.19)
Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov . - 【arXiv.org】
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots (2023.07.16)
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, etc
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models (2023.07.13)
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, etc
Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels (2023.07.05)
Bang Yang, Fenglin Liu, Zheng Li, Qingyu Yin, Chenyu You, etc . - 【Annual Meeting of the Association for Computational Linguistics】
Multimodal Prompt Retrieval for Generative Visual Question Answering (2023.06.30)
Timothy Ossowski, Junjie Hu . - 【arXiv.org】
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs (2023.06.30)
Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, etc . - 【arXiv.org】
Multimodal Prompt Learning in Emotion Recognition Using Context and Audio Information (2023.06.28)
Eunseo Jeong, Gyu-Min Kim, Sangwoo Kang . - 【Mathematics】
Palm: Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 2023 (2023.06.28)
Daoji Huang, Otmar Hilliges, L. Gool, Xi Wang . - 【arXiv.org】
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (2023.06.27)
Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, etc . - 【arXiv.org】
DocEdit: Language-Guided Document Editing (2023.06.26)
Puneet Mathur, R. Jain, Jiuxiang Gu, Franck Dernoncourt, Dinesh Manocha, etc . - 【AAAI Conference on Artificial Intelligence】
PromptIR: Prompting for All-in-One Blind Image Restoration (2023.06.22)
Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, F. Khan
A. Youssef . - 【Ultrasound in Obstetrics and Gynecology】
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration (2023.06.15)
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, etc . - 【arXiv.org】
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models (2023.06.14)
Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu, etc . - 【arXiv.org】
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (2023.06.05)
Han Zhang, Xin Li, Lidong Bing . - 【arXiv.org】
HeadSculpt: Crafting 3D Head Avatars with Text (2023.06.05)
Xiaoping Han, Yukang Cao, K. Han, Xiatian Zhu, Jiankang Deng, etc . - 【arXiv.org】
Meta-Learning For Vision-and-Language Cross-lingual Transfer (2023.05.24)
Hanxu Hu, Frank Keller
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models (2023.05.24)
Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, etc
Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, etc
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought (2023.05.24)
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, etc
OverPrompt: Enhancing ChatGPT Capabilities through an Efficient In-Context Learning Approach (2023.05.24)
Jiazheng Li, Runcong Zhao, Yulan He, Lin Gui
In-Context Demonstration Selection with Cross Entropy Difference (2023.05.24)
Dan Iter, Reid Pryzant, Ruochen Xu, Shuohang Wang, Yang Liu, etc
Abductive Commonsense Reasoning Exploiting Mutually Exclusive Explanations (2023.05.24)
Wenting Zhao, Justin T. Chiu, Claire Cardie, Alexander M. Rush
Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks (2023.05.23)
Tiedong Liu, Bryan Kian Hsiang Low
Masked Path Modeling for Vision-and-Language Navigation (2023.05.23)
Zi-Yi Dou, Feng Gao, Nanyun Peng . - 【arXiv.org】
ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models (2023.05.23)
Z. Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, etc
LM-Switch: Lightweight Language Model Conditioning in Word Embedding Space (2023.05.22)
Chi Han, Jialiang Xu, Manling Li, Y. Fung, Chenkai Sun, etc
Enhancing Cross-lingual Natural Language Inference by Soft Prompting with Multilingual Verbalizer (2023.05.22)
Shuang Li, Xuming Hu, Aiwei Liu, Yawen Yang, Fukun Ma, etc
A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches (2023.05.22)
Zihan Wang, Tianle Wang, Dheeraj Mekala, Jingbo Shang
Enhance Reasoning Ability of Visual-Language Models via Large Language Models (2023.05.22)
Yueting Yang, Xintong Zhang, Wenjuan Han
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending (2023.05.22)
Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, etc . - 【arXiv.org】
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions (2023.05.21)
Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning (2023.05.20)
Liangming Pan, Alon Albalak, Xinyi Wang, William Yang Wang
LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4 (2023.05.20)
Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, etc
SelfzCoT: a Self-Prompt Zero-shot CoT from Semantic-level to Code-level for a Better Utilization of LLMs (2023.05.19)
IokTong Lei, ZhiDong Deng . - 【arXiv.org】
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning (2023.05.19)
Mustafa Safa Ozdayi, Charith S. Peris, Jack G. M. FitzGerald, Christophe Dupuy, Jimit Majmudar, etc . - 【arXiv.org】
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought (2023.05.19)
Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, etc . - 【arXiv.org】
TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding (2023.05.19)
Chenchi Zhang, Jun Xiao, Lei Chen, Jian Shao, Long Chen . - 【arXiv.org】
Efficient Prompting via Dynamic In-Context Learning (2023.05.18)
Wangchunshu Zhou, Yuchen Jiang, Ryan Cotterell, Mrinmaya Sachan . - 【arXiv.org】
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers (2023.05.12)
L. Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, etc
Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models (2023.05.08)
Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han . - 【arXiv.org】
Prompt What You Need: Enhancing Segmentation in Rainy Scenes with Anchor-based Prompting (2023.05.06)
Xiaoyuan Guo, Xiang Wei, Q. Su, Hui-Huang Zhao, Shunli Zhan . - 【arXiv.org】
Edit Everything: A Text-Guided Generative System for Images Editing (2023.04.27)
Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, etc
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System (2023.04.27)
Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, etc
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (2023.04.27)
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, etc
Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models (2023.04.18)
Stephen Brade, Bryan Wang, Maurício Sousa, Sageev Oore, Tovi Grossman . - 【arXiv.org】
Towards Robust Prompts on Vision-Language Models (2023.04.17)
Jindong Gu, A. Beirami, Xuezhi Wang, Alex Beutel, Philip H. S. Torr, etc
Visual Instruction Tuning (2023.04.17)
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text (2023.04.14)
Wanrong Zhu, Jack Hessel, Anas Awadalla, S. Gadre, Jesse Dodge, etc
Segment Everything Everywhere All at Once (2023.04.13)
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, etc . - 【arXiv.org】
Efficient Multimodal Fusion via Interactive Prompting (2023.04.13)
Yaowei Li, Ruijie Quan, Linchao Zhu, Yezhou Yang
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning (2023.04.12)
Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, etc
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition (2023.04.10)
Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, etc . - 【arXiv.org】
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions (2023.04.09)
Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny . - 【arXiv.org】
Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting (2023.04.06)
TagGPT: Large Language Models are Zero-shot Multimodal Taggers (2023.04.06)
Segment Anything (2023.04.05)
A. Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, etc . - 【arXiv.org】
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance (2023.03.29)
Ziyu Guo, Yiwen Tang, Renrui Zhang, Dong Wang, Zhigang Wang, etc . - 【arXiv.org】
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs (2023.03.29)
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, etc . - 【arXiv.org】
MEDIMP: Medical Images and Prompts for renal transplant representation learning (2023.03.22)
Leo Milecki, Vicky Kalogeiton, Sylvain Bodard, Dany Anglicheau, Jean-Michel Correas, etc
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition (2023.03.20)
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (2023.03.20)
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, etc . - 【ArXiv】
Visual Prompt Multi-Modal Tracking (2023.03.20)
Audio Visual Language Maps for Robot Navigation (2023.03.13)
Chen Huang, Oier Mees, Andy Zeng, W. Burgard . - 【ArXiv】
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions (2023.03.12)
Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, etc . - 【arXiv.org】
Text-Visual Prompting for Efficient 2D Temporal Video Grounding (2023.03.09)
Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding . - 【Computer Vision and Pattern Recognition】
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (2023.03.08)
Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, etc
Multimodal Parameter-Efficient Few-Shot Class Incremental Learning (2023.03.08)
Marco D’Alessandro, Alberto Alonso, Enrique Calabr'es, M. Galar . - 【ArXiv】
Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning (2023.03.06)
Zhen Wang, R. Panda, Leonid Karlinsky, R. Feris, Huan Sun, etc
Multimodal Prompting with Missing Modalities for Visual Recognition (2023.03.06)
Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, Chen-Yu Lee . - 【ArXiv】
Multimodal Chain-of-Thought Reasoning in Language Models (2023.02.02)
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, G. Karypis, etc . - 【ArXiv】
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation (2022.10.19)
Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li, etc . - 【Conference on Empirical Methods in Natural Language Processing】
CoHOZ: Contrasive Multimodal prompt Tuning for Hierarchical Open-set Zero-shot Recognition (2022.10.10)
Ning Liao, Yifeng Liu, Li Xiaobo, Chenyi Lei, Guoxin Wang, etc . - 【Proceedings of the 30th ACM International Conference on Multimedia】
VIMA: General Robot Manipulation with Multimodal Prompts (2022.10.06)
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, etc . - 【ArXiv】
Learning to Prompt for Vision-Language Models (2022.09.01)
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu
Visual Prompt Tuning (2022.03.23)
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, S. Belongie, etc . - 【European Conference on Computer Vision】
Multimodal Few-Shot Learning with Frozen Language Models (2021.06.25)
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. Eslami, Oriol Vinyals, etc . - 【Neural Information Processing Systems】
MPT: Multimodal Prompt Tuning for Event Detection
Similarity-Aware Multimodal Prompt Learning for Fake News Detection
Ye Jiang, Xiaomin Yu, Yimin Wang, Xiaoman Xu, Xingyi Song, etc . - 【SSRN Electronic Journal】