Adapter modules with support for multimodal fusion of information (text, video, audio, etc.) using pre-trained BERT base model. For a more detailed review of the architecture, refere to my master thesis pdf. In section 4.2 "Parameter-Efficient Transfer Learning for Multimodal Tasks" I describe the architectural changes made in order for the BERT model to support multimodal inputs.
A journal paper is on the way to present some interesting results with this architecture.
The proposed architecture was used to perform experiments with a movie genre multimodal classification task (Moviescope). Multimodal-Adapter was compared with MMBT, showing on par performance with a significant reduction in the number of parameters modified during finetuning. For more details about the experiments please refer section 5.5 "Multimodal Adapter Experiments" here.
The text input, is represented by the BERT hidden vectors
-
Adapters: Parameter-Efficient Transfer Learning for NLP" by Houlsby et al.
-
Adapter Fusion: AdapterFusion: Non-Destructive Task Composition for Transfer Learning" by Pfeiffer et al.
-
MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text by Kiela et al.
-
GMU Gated Multimodal Units for Information Fusion by Arevalo et al.
-
Moviescope: Moviescope: Large-scale Analysis of Movies using Multiple Modalities by Cascante-Bonilla et al.