Skip to content

Latest commit

 

History

History
1433 lines (711 loc) · 66.7 KB

multimodalprompt.md

File metadata and controls

1433 lines (711 loc) · 66.7 KB

📄 Multimodal Prompt

Paper List

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output2024.07.03

Pan Zhang, Xiao-wen Dong, Yuhang Zang, Yuhang Cao, Rui Qian, etc


LLaRA: Supercharging Robot Learning Data for Vision-Language Policy2024.06.28

Xiang Li, Cristina Mata, Jong Sung Park, Kumara Kahatapitiya, Yoo Sung Jang, etc


Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs2024.06.28

Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, etc


LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression2024.06.28

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, etc


Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs2024.06.24

Shengbang Tong, Ellis L Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, etc


VoCo-LLaMA: Towards Vision Compression with Large Language Models2024.06.18

Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, etc


Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models2024.06.12

Yi-Fan Zhang, Qingsong Wen, Chaoyou Fu, Xue Wang, Zhang Zhang, etc


An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models2024.06.07

Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, V'ictor Guti'errez-Basulto, etc


Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning2024.06.04

Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, etc


DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models2024.05.31

Linli Yao, Lei Li, Shuhuai Ren, Lean Wang, Yuanxin Liu, etc . - 【arXiv.org】


Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis2024.05.31

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, etc . - 【arXiv.org】


ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models2024.05.24

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, etc . - 【arXiv.org】


Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models2024.05.24

Yue Zhang, Hehe Fan, Yi Yang . - 【arXiv.org】


Probing Multimodal LLMs as World Models for Driving2024.05.09

Shiva Sreeram, T. Wang, Alaa Maalouf, G. Rosman, S. Karaman, etc . - 【arXiv.org】


Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning2024.05.09

Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, etc . - 【arXiv.org】


Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers2024.05.09

Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, etc . - 【arXiv.org】


Auto-Encoding Morph-Tokens for Multimodal LLM2024.05.03

Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, etc . - 【arXiv.org】


EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model2024.05.01

Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, etc . - 【arXiv.org】


CofiPara: A Coarse-to-fine Paradigm for Multimodal Sarcasm Target Identification with Large Multimodal Models2024.05.01

Hongzhan Lin, Zixin Chen, Ziyang Luo, Mingfei Cheng, Jing Ma, etc . - 【arXiv.org】


Training-Free Unsupervised Prompt for Vision-Language Models2024.04.25

Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, etc . - 【arXiv.org】


AAPL: Adding Attributes to Prompt Learning for Vision-Language Models2024.04.25

Gahyeon Kim, Sohee Kim, Seokju Lee . - 【arXiv.org】


Cantor: Inspiring Multimodal Chain-of-Thought of MLLM2024.04.24

Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, etc . - 【arXiv.org】


MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI2024.04.24

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqiang Li, Han Lin, etc . - 【arXiv.org】


Unified Scene Representation and Reconstruction for 3D Large Language Models2024.04.19

Tao Chu, Pan Zhang, Xiao-wen Dong, Yuhang Zang, Qiong Liu, etc . - 【arXiv.org】


BRAVE: Broadening the visual encoding of vision-language models2024.04.10

Ouguzhan Fatih Kar, A. Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, etc


ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling2024.04.10

Ege Ozsoy, Chantal Pellegrini, Matthias Keicher, Nassir Navab


MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation2024.04.08

Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, A. Elgammal, etc


Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs2024.04.08

Keen You, Haotian Zhang, E. Schoop, Floris Weers, Amanda Swearngin, etc


MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens2024.04.04

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, etc


ViTamin: Designing Scalable Vision Models in the Vision-Language Era2024.04.02

Jienneg Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen


Segment Any 3D Object with Language2024.04.02

Seungjun Lee, Yuyang Zhao, Gim Hee Lee


Iterated Learning Improves Compositionality in Large Vision-Language Models2024.04.02

Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna


Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models2024.03.27

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, etc . - 【arXiv.org】


Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models2024.03.25

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, etc


Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning2024.03.21

Hasindri Watawana, Kanchana Ranasinghe, Tariq Mahmood, Muzammal Naseer, Salman Khan, etc


MyVLM: Personalizing VLMs for User-Specific Queries2024.03.21

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or


MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?2024.03.21

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, etc


PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model2024.03.21

Zheng Zhang, Yeyao Ma, Enming Zhang, Xiang Bai


SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models2024.03.20

Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, etc


The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?2024.03.14

Qinyu Zhao, Ming Xu, Kartik Gupta, Akshay Asthana, Liang Zheng, etc


3D-VLA: A 3D Vision-Language-Action Generative World Model2024.03.14

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, etc


UniCode: Learning a Unified Codebook for Multimodal Large Language Models2024.03.14

Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu


DeepSeek-VL: Towards Real-World Vision-Language Understanding2024.03.08

Haoyu Lu, Wen Liu, Bo Zhang, Bing-Li Wang, Kai Dong, etc


VLM-PL: Advanced Pseudo Labeling approach Class Incremental Object Detection with Vision-Language Model2024.03.08

Junsu Kim, Yunhoe Ku, Jihyeon Kim, Junuk Cha, Seungryul Baek


TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document2024.03.07

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, etc


Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models2024.03.05

Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, etc


RegionGPT: Towards Region Understanding Vision Language Model2024.03.04

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, etc


Non-autoregressive Sequence-to-Sequence Vision-Language Models2024.03.04

Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, S. Soatto


Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers2024.02.29

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, etc


Tower: An Open Multilingual Large Language Model for Translation-Related Tasks2024.02.27

Duarte M. Alves, José P. Pombal, Nuno M. Guerreiro, Pedro H. Martins, Joao Alves, etc . - 【arXiv.org】


ShapeLLM: Universal 3D Object Understanding for Embodied Interaction2024.02.27

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, etc . - 【arXiv.org】


VRP-SAM: SAM with Visual Reference Prompt2024.02.27

Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, etc . - 【arXiv.org】


GROUNDHOG: Grounding Large Language Models to Holistic Segmentation2024.02.26

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, etc


Genie: Generative Interactive Environments2024.02.23

Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, etc


LLMBind: A Unified Modality-Task Integration Framework2024.02.22

Bin Zhu, Peng Jin, Munan Ning, Bin Lin, Jinfa Huang, etc . - 【arXiv.org】


How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts2024.02.20

Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan


Video ReCap: Recursive Captioning of Hour-Long Videos2024.02.20

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, etc


An Empirical Study Into What Matters for Calibrating Vision-Language Models2024.02.12

Weijie Tu, Weijian Deng, Dylan Campbell, Stephen Gould, Tom Gedeon . - 【arXiv.org】


SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models2024.02.08

Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, etc . - 【arXiv.org】


Binding Touch to Everything: Learning Unified Multimodal Tactile Representations2024.01.31

Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, etc . - 【arXiv.org】


InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model2024.01.29

Xiao-wen Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, etc . - 【arXiv.org】


Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities2024.01.25

Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, etc . - 【arXiv.org】


MM-LLMs: Recent Advances in MultiModal Large Language Models2024.01.24

Duzhen Zhang, Yahan Yu, Chenxing Li, Jiahua Dong, Dan Su, etc . - 【arXiv.org】


Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models2024.01.06

Xin He, Longhui Wei, Lingxi Xie, Qi Tian . - 【arXiv.org】


Learning to Prompt with Text Only Supervision for Vision-Language Models2024.01.04

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, L. V. Gool, F. Tombari . - 【arXiv.org】


Instruct-Imagen: Image Generation with Multi-modal Instruction2024.01.03

Hexiang Hu, Kelvin C.K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, etc . - 【arXiv.org】


G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model2023.12.18

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, etc


Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects2023.12.08

Junyu Lu, Ruyi Gan, Di Zhang, Xiaojun Wu, Ziwei Wu, etc . - 【arXiv.org】


LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models2023.12.05

Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, etc


Sequential Modeling Enables Scalable Learning for Large Vision Models2023.12.01

Yutong Bai, Xinyang Geng, K. Mangalam, Amir Bar, Alan Yuille, etc


LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models2023.11.28

Yanwei Li, Chengyao Wang, Jiaya Jia . - 【arXiv.org】


MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers2023.11.27

Yawar Siddiqui, A. Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, etc . - 【arXiv.org】


An Embodied Generalist Agent in 3D World2023.11.18

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, etc . - 【arXiv.org】


Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning2023.11.17

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, S. Azadi, etc . - 【arXiv.org】


MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning2023.11.16

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, etc . - 【arXiv.org】


Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding2023.11.14

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, Li Yuan . - 【arXiv.org】


EviPrompt: A Training-Free Evidential Prompt Generation Method for Segment Anything Model in Medical Images2023.11.10

Yinsong Xu, Jiaqi Tang, Aidong Men, Qingchao Chen . - 【arXiv.org】


u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model2023.11.09

Jinjin Xu, Liwu Xu, Yuzhe Yang, Xiang Li, Yanchun Xie, etc . - 【arXiv.org】


Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges2023.11.06

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, etc . - 【arXiv.org】


Levels of AGI: Operationalizing Progress on the Path to AGI2023.11.04

Meredith Ringel Morris, Jascha Narain Sohl-Dickstein, Noah Fiedel, T. Warkentin, Allan Dafoe, etc . - 【arXiv.org】


Woodpecker: Hallucination Correction for Multimodal Large Language Models2023.10.24

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, etc


3D-GPT: Procedural 3D Modeling with Large Language Models2023.10.19

Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, etc . - 【arXiv.org】


BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation2023.10.16

Ji Qi, Kaixuan Ji, Jifan Yu, Duokang Wang, Bin Xu, etc . - 【arXiv.org】


MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens2023.10.03

Kaizhi Zheng, Xuehai He, Xin Eric Wang . - 【arXiv.org】


Kosmos-2.5: A Multimodal Literate Model2023.09.20

Tengchao Lv, Yupan Huang, Jingye Chen, Lei Cui, Shuming Ma, etc . - 【arXiv.org】


Investigating the Catastrophic Forgetting in Multimodal Large Language Models2023.09.19

Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, etc . - 【arXiv.org】


Physically Grounded Vision-Language Models for Robotic Manipulation2023.09.05

Jensen Gao, Bidipta Sarkar, F. Xia, Ted Xiao, Jiajun Wu, etc


Physically Grounded Vision-Language Models for Robotic Manipulation2023.09.05

Jensen Gao, Bidipta Sarkar, F. Xia, Ted Xiao, Jiajun Wu, etc . - 【arXiv.org】


Point-Bind&Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following2023.09.01

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, etc


PointLLM: Empowering Large Language Models to Understand Point Clouds2023.08.31

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, etc . - 【arXiv.org】


PE-MED: Prompt Enhancement for Interactive Medical Image Segmentation2023.08.26

Ao Chang, Xing Tao, Xin Yang, Yuhao Huang, Xinrui Zhou, etc . - 【arXiv.org】


SeamlessM4T-Massively Multilingual & Multimodal Machine Translation2023.08.22

Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, etc . - 【arXiv.org】


Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes2023.08.17

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao . - 【arXiv.org】


VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use2023.08.12

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, etc . - 【arXiv.org】


3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment2023.08.08

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, etc . - 【arXiv.org】


UniVTG: Towards Unified Video-Language Temporal Grounding2023.07.31

Kevin Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, etc . - 【arXiv.org】


Med-Flamingo: a Multimodal Medical Few-shot Learner2023.07.27

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, C. Zakka, etc . - 【arXiv.org】


OBJECT 3DIT: Language-guided 3D-aware Image Editing2023.07.20

Oscar Michel, Anand Bhattad, Eli VanderBilt, Ranjay Krishna, Aniruddha Kembhavi, etc . - 【arXiv.org】


Brain2Music: Reconstructing Music from Human Brain Activity2023.07.20

Timo I. Denk, Yu Takagi, Takuya Matsuyama, A. Agostinelli, Tomoya Nakai, etc . - 【arXiv.org】


(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs2023.07.19

Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov . - 【arXiv.org】


MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots2023.07.16

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, etc


HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models2023.07.13

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, etc


Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels2023.07.05

Bang Yang, Fenglin Liu, Zheng Li, Qingyu Yin, Chenyu You, etc . - 【Annual Meeting of the Association for Computational Linguistics】


Multimodal Prompt Retrieval for Generative Visual Question Answering2023.06.30

Timothy Ossowski, Junjie Hu . - 【arXiv.org】


SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs2023.06.30

Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, etc . - 【arXiv.org】


Multimodal Prompt Learning in Emotion Recognition Using Context and Audio Information2023.06.28

Eunseo Jeong, Gyu-Min Kim, Sangwoo Kang . - 【Mathematics】


Palm: Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 20232023.06.28

Daoji Huang, Otmar Hilliges, L. Gool, Xi Wang . - 【arXiv.org】


Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic2023.06.27

Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, etc . - 【arXiv.org】


DocEdit: Language-Guided Document Editing2023.06.26

Puneet Mathur, R. Jain, Jiuxiang Gu, Franck Dernoncourt, Dinesh Manocha, etc . - 【AAAI Conference on Artificial Intelligence】


PromptIR: Prompting for All-in-One Blind Image Restoration2023.06.22

Vaishnav Potlapalli, Syed Waqas Zamir, Salman Khan, F. Khan


Unleashing the AI revolution: exploring the capabilities and challenges of large language models and text‐to‐image AI programs2023.06.17

A. Youssef . - 【Ultrasound in Obstetrics and Gynecology】


Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration2023.06.15

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, etc . - 【arXiv.org】


Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models2023.06.14

Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu, etc . - 【arXiv.org】


Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding2023.06.05

Han Zhang, Xin Li, Lidong Bing . - 【arXiv.org】


HeadSculpt: Crafting 3D Head Avatars with Text2023.06.05

Xiaoping Han, Yukang Cao, K. Han, Xiatian Zhu, Jiankang Deng, etc . - 【arXiv.org】


Meta-Learning For Vision-and-Language Cross-lingual Transfer2023.05.24

Hanxu Hu, Frank Keller


LayoutGPT: Compositional Visual Planning and Generation with Large Language Models2023.05.24

Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, etc


Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models2023.05.24

Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung, Sanghee Park, etc


EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought2023.05.24

Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, etc


OverPrompt: Enhancing ChatGPT Capabilities through an Efficient In-Context Learning Approach2023.05.24

Jiazheng Li, Runcong Zhao, Yulan He, Lin Gui


In-Context Demonstration Selection with Cross Entropy Difference2023.05.24

Dan Iter, Reid Pryzant, Ruochen Xu, Shuohang Wang, Yang Liu, etc


Abductive Commonsense Reasoning Exploiting Mutually Exclusive Explanations2023.05.24

Wenting Zhao, Justin T. Chiu, Claire Cardie, Alexander M. Rush


Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks2023.05.23

Tiedong Liu, Bryan Kian Hsiang Low


Masked Path Modeling for Vision-and-Language Navigation2023.05.23

Zi-Yi Dou, Feng Gao, Nanyun Peng . - 【arXiv.org】


ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models2023.05.23

Z. Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Wayne Xin Zhao, etc


LM-Switch: Lightweight Language Model Conditioning in Word Embedding Space2023.05.22

Chi Han, Jialiang Xu, Manling Li, Y. Fung, Chenkai Sun, etc


Enhancing Cross-lingual Natural Language Inference by Soft Prompting with Multilingual Verbalizer2023.05.22

Shuang Li, Xuming Hu, Aiwei Liu, Yawen Yang, Fukun Ma, etc


A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches2023.05.22

Zihan Wang, Tianle Wang, Dheeraj Mekala, Jingbo Shang


Enhance Reasoning Ability of Visual-Language Models via Large Language Models2023.05.22

Yueting Yang, Xintong Zhang, Wenjuan Han


VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending2023.05.22

Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, etc . - 【arXiv.org】


InstructVid2Vid: Controllable Video Editing with Natural Language Instructions2023.05.21

Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang


Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning2023.05.20

Liangming Pan, Alon Albalak, Xinyi Wang, William Yang Wang


LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-42023.05.20

Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, etc


SelfzCoT: a Self-Prompt Zero-shot CoT from Semantic-level to Code-level for a Better Utilization of LLMs2023.05.19

IokTong Lei, ZhiDong Deng . - 【arXiv.org】


Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning2023.05.19

Mustafa Safa Ozdayi, Charith S. Peris, Jack G. M. FitzGerald, Christophe Dupuy, Jimit Majmudar, etc . - 【arXiv.org】


RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought2023.05.19

Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, etc . - 【arXiv.org】


TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding2023.05.19

Chenchi Zhang, Jun Xiao, Lei Chen, Jian Shao, Long Chen . - 【arXiv.org】


Efficient Prompting via Dynamic In-Context Learning2023.05.18

Wangchunshu Zhou, Yuchen Jiang, Ryan Cotterell, Mrinmaya Sachan . - 【arXiv.org】


MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers2023.05.12

L. Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, etc


Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models2023.05.08

Wenkai Dong, Song Xue, Xiaoyue Duan, Shumin Han . - 【arXiv.org】


Prompt What You Need: Enhancing Segmentation in Rainy Scenes with Anchor-based Prompting2023.05.06

Xiaoyuan Guo, Xiang Wei, Q. Su, Hui-Huang Zhao, Shunli Zhan . - 【arXiv.org】


Edit Everything: A Text-Guided Generative System for Images Editing2023.04.27

Defeng Xie, Ruichen Wang, Jian Ma, Chen Chen, Haonan Lu, etc


ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System2023.04.27

Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, etc


mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality2023.04.27

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, etc


Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models2023.04.18

Stephen Brade, Bryan Wang, Maurício Sousa, Sageev Oore, Tovi Grossman . - 【arXiv.org】


Towards Robust Prompts on Vision-Language Models2023.04.17

Jindong Gu, A. Beirami, Xuezhi Wang, Alex Beutel, Philip H. S. Torr, etc


Visual Instruction Tuning2023.04.17

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee


Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text2023.04.14

Wanrong Zhu, Jack Hessel, Anas Awadalla, S. Gadre, Jesse Dodge, etc


Segment Everything Everywhere All at Once2023.04.13

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, etc . - 【arXiv.org】


Efficient Multimodal Fusion via Interactive Prompting2023.04.13

Yaowei Li, Ruijie Quan, Linchao Zhu, Yezhou Yang


ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning2023.04.12

Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, etc


Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition2023.04.10

Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, etc . - 【arXiv.org】


Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions2023.04.09

Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, Mohamed Elhoseiny . - 【arXiv.org】


Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting2023.04.06


TagGPT: Large Language Models are Zero-shot Multimodal Taggers2023.04.06


Segment Anything2023.04.05

A. Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, etc . - 【arXiv.org】


ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance2023.03.29

Ziyu Guo, Yiwen Tang, Renrui Zhang, Dong Wang, Zhigang Wang, etc . - 【arXiv.org】


TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs2023.03.29

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, etc . - 【arXiv.org】


MEDIMP: Medical Images and Prompts for renal transplant representation learning2023.03.22

Leo Milecki, Vicky Kalogeiton, Sylvain Bodard, Dany Anglicheau, Jean-Michel Correas, etc


CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition2023.03.20


MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action2023.03.20

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, etc . - 【ArXiv】


Visual Prompt Multi-Modal Tracking2023.03.20


Audio Visual Language Maps for Robot Navigation2023.03.13

Chen Huang, Oier Mees, Andy Zeng, W. Burgard . - 【ArXiv】


ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions2023.03.12

Deyao Zhu, Jun Chen, Kilichbek Haydarov, Xiaoqian Shen, Wenxuan Zhang, etc . - 【arXiv.org】


Text-Visual Prompting for Efficient 2D Temporal Video Grounding2023.03.09

Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding . - 【Computer Vision and Pattern Recognition】


Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models2023.03.08

Chenfei Wu, Sheng-Kai Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, etc


Multimodal Parameter-Efficient Few-Shot Class Incremental Learning2023.03.08

Marco D’Alessandro, Alberto Alonso, Enrique Calabr'es, M. Galar . - 【ArXiv】


Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning2023.03.06

Zhen Wang, R. Panda, Leonid Karlinsky, R. Feris, Huan Sun, etc


Multimodal Prompting with Missing Modalities for Visual Recognition2023.03.06

Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, Chen-Yu Lee . - 【ArXiv】


Multimodal Chain-of-Thought Reasoning in Language Models2023.02.02

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, G. Karypis, etc . - 【ArXiv】


LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation2022.10.19

Hongcheng Guo, Jiaheng Liu, Haoyang Huang, Jian Yang, Zhoujun Li, etc . - 【Conference on Empirical Methods in Natural Language Processing】


CoHOZ: Contrasive Multimodal prompt Tuning for Hierarchical Open-set Zero-shot Recognition2022.10.10

Ning Liao, Yifeng Liu, Li Xiaobo, Chenyi Lei, Guoxin Wang, etc . - 【Proceedings of the 30th ACM International Conference on Multimedia】


VIMA: General Robot Manipulation with Multimodal Prompts2022.10.06

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, etc . - 【ArXiv】


Learning to Prompt for Vision-Language Models2022.09.01

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu


Visual Prompt Tuning2022.03.23

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, S. Belongie, etc . - 【European Conference on Computer Vision】


Multimodal Few-Shot Learning with Frozen Language Models2021.06.25

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. Eslami, Oriol Vinyals, etc . - 【Neural Information Processing Systems】


MPT: Multimodal Prompt Tuning for Event Detection


Similarity-Aware Multimodal Prompt Learning for Fake News Detection

Ye Jiang, Xiaomin Yu, Yimin Wang, Xiaoman Xu, Xingyi Song, etc . - 【SSRN Electronic Journal】

CONTINUE...