Multimodal-Adapters

Adapter modules with support for multimodal fusion of information (text, video, audio, etc.) using pre-trained BERT base model. For a more detailed review of the architecture, refere to my master thesis pdf. In section 4.2 "Parameter-Efficient Transfer Learning for Multimodal Tasks" I describe the architectural changes made in order for the BERT model to support multimodal inputs.

A journal paper is on the way to present some interesting results with this architecture.

Experiments

The proposed architecture was used to perform experiments with a movie genre multimodal classification task (Moviescope). Multimodal-Adapter was compared with MMBT, showing on par performance with a significant reduction in the number of parameters modified during finetuning. For more details about the experiments please refer section 5.5 "Multimodal Adapter Experiments" here.

Multimodal Adapter Architecture

The text input, is represented by the BERT hidden vectors $H \in \mathbb{R}^{l \times 768}$ where $l$ is the text sequence length, and $m$ extra modalities ($m=4$ for the shown example). Given that the GMU module receives one vector per modality, for modalities that consist in a sequence of features (like video frames or audio spectrogram), we take the average of the sequence's elements before taking them as input to the Multimodal Adapter layer, as shown in point 1 in diagram. Once we have a single vector per modality (except for text) $M_i \in \mathbb{R}^{d_i}$ where $d_i$ represents the modality's dimension, we take the same approach as Houlsby2019 for each modality, i.e., we apply a Normalization layer together with a FeedForward down-project to reduce the vector dimension to $p$ (in this case dimension $p=64$) at point 2 of diagram, and finally apply a non-linear activation function at point 3. Unlike the original Adapter, where the FeedForward up-project is applied right after the non-linearity, here we first combine the different modalities inside the GMU module, as shown in point 4 of diagram, and once we have fused them all, we apply the FeedForward up-project to the original BERT hidden dimension (768) and add the information from the skip connection, as indicated in point 5 of the diagram.

Experiments based on:

Adapters: Parameter-Efficient Transfer Learning for NLP" by Houlsby et al.
Adapter Fusion: AdapterFusion: Non-Destructive Task Composition for Transfer Learning" by Pfeiffer et al.
MMBT: Supervised Multimodal Bitransformers for Classifying Images and Text by Kiela et al.
GMU Gated Multimodal Units for Information Fusion by Arevalo et al.
Moviescope: Moviescope: Large-scale Analysis of Movies using Multiple Modalities by Cascante-Bonilla et al.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
mmbt		mmbt
Models_comparison.ipynb		Models_comparison.ipynb
README.md		README.md
Visualize_attentions.ipynb		Visualize_attentions.ipynb
multimodal-adapter-diagram.png		multimodal-adapter-diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal-Adapters

Experiments

Multimodal Adapter Architecture

Experiments based on:

About

Releases

Packages

Languages

IsaacRodgz/Multimodal-Adapters

Folders and files

Latest commit

History

Repository files navigation

Multimodal-Adapters

Experiments

Multimodal Adapter Architecture

Experiments based on:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages