Pytorch implementation for the paper "Multimodal Transformer for Unaligned Multimodal Language Sequences".
Original author's implementation is here.
-
Data files (containing processed MOSI, MOSEI and IEMOCAP datasets) can be downloaded from here.
-
To retrieve the meta information and the raw data, please refer to the SDK for these datasets.
- Python 3.6
- Pytorch (>=1.0.0) and torchvision
- CUDA 10.0 or above
- Create (empty) folders for data and pre-trained models:
mkdir data pre_trained_models
and put the downloaded data in 'data/'.
- Command as follows
python main.py [--FLAGS]
Note that the defualt arguments are for unaligned version of MOSEI. For other datasets, please refer to Supplmentary.
nohup python main.py &
- unaligned version of MOSEI
Output: nohup.out
MAE: 0.6139981
Correlation Coefficient: 0.6773945850196033
mult_acc_7: 0.48873148744365746
mult_acc_5: 0.5028976175144881
F1 score: 0.8201431177436439
Accuracy: 0.8200330214639515
Transformer requires no CTC module. However, as we describe in the paper, CTC module offers an alternative to applying other kinds of sequence models (e.g., recurrent architectures) to unaligned multimodal streams.
If you want to use the CTC module, plesase install warp-ctc from here.
The quick version:
git clone https://github.com/SeanNaren/warp-ctc.git
cd warp-ctc
mkdir build; cd build
cmake ..
make
cd ../pytorch_binding
python setup.py install
export WARP_CTC_PATH=/home/xxx/warp-ctc/build
Some portion of the code were adapted from the fairseq repo.
@inproceedings{tsai2019multimodal,
title={Multimodal Transformer for Unaligned Multimodal Language Sequences},
author={Tsai, Yao-Hung Hubert and Bai, Shaojie and Liang, Paul Pu and Kolter, J Zico and Morency, Louis-Philippe and Salakhutdinov, Ruslan},
booktitle={Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)},
year={2019}
}