Multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images. Input any of the six modalities and get the same sized embedding that can be used for cross-modal and multimodal tasks.
Multimodal joint embedding model for image/video, text, audio, depth, IMU, and thermal images
- Developed by: Meta AI
- Model type: Multimodal model
- Language(s) (NLP): en
- License: CC BY-NC-SA 4.0
- Resources for more information:
This model is intended only for research purposes. It provides a joint embedding space for different modalities -- image/video, text, audio, depth, IMU and thermal images. We hope that these joint embeddings can be used for a variety of different cross-modal research, e.g., cross-modal retrieval and combining embeddings from different modalities.
This model is NOT intended to be used in any real world application -- commercial or otherwise. It may produce harmful associations with different inputs. The model needs to be investigated and likely re-trained on specific data for any such application. The model is expected to work better on web-based visual data since it was trained on such data. The text encoder is likely to work only on English language text because of the underlying training datasets.
Open-domain joint embedding models are prone to producing specific biases, e.g., study from CLIP. Since our model uses such models as initialization, it will exhibit such biases too. Moreover, for learning joint embeddings for other modalities such as audio, thermal, depth, and IMU we leverage datasets that are relatively small. These joint embeddings are thus limited to the concepts present in the datasets. For example, the thermal datasets we used are limited to outdoor street scenes, while the depth datasets are limited to indoor scenes.
ImageBind uses image-paired data for training -- (image, X) where X is one of text, audio, depth, IMU or thermal data. In particular, we initialize and freeze the image and text encoders using an OpenCLIP ViT-H encoder. We train audio embeddings using Audioset, depth embeddings using the SUN RGB-D dataset, IMU using the Ego4D dataset and thermal embeddings using the LLVIP dataset. We provide the exact training data details in the paper.
Please refer to the research paper and github repo for exact details on this.
We evaluate the model on a variety of different classification benchmarks for each modality. The evaluation details are presented in the paper. The models performance is measured using standard classification metrics such as accuracy and mAP.
BibTeX:
@inproceedings{girdhar2023imagebind,
title={ImageBind: One Embedding Space To Bind Them All},
author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
booktitle={CVPR},
year={2023}
}
Please reach out to the authors at: rgirdhar@meta.com imisra@meta.com alaaelnouby@gmail.com
Our github repo provides a simple example to extract embeddings from images, audio etc.