This repository is the official implementation of the paper BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering
Abstract view of the proposed BVQA data generation method. The left block shows the context, such as the caption used to prompt LLM, and the right block shows the responses (i.e., QA pairs) generated by LLM. Note that the visual image is not used to prompt LLM; it is only shown here as a reference.
Overall framework of our proposed Bengali VQA model, MCRAN. The model takes image and question as input and generates embeddings with their corresponding encoder. The top block in the middle part produces two cross-modal attentive representations: ICAR (Image Weighted Cross-modal Representation) and TCAR (Text Weighted Cross-modal Representation). In contrast, the bottom block creates a token-level multimodal attentive representation (MMAR). Finally, our method fuses these three attentive knowledge vectors through a gating mechanism to obtain richer multimodal features.
-
To reproduce the results, you need to install
Python=3.10.x
. All the models are implemented usingPytorch=2.4.0
. TheMCRAN
requires a GPU with 16GB RAM. -
If you use any IDE, first clone (
git clone <url>
) the repository. Then, create a virtual environment and activate it.conda create -n Bengali-VQA Python=3.10.12 conda activate Bengali-VQA
-
Install all the dependencies.
pip install -r requirements.txt
The dataset can be downloaded from this link: BVQA-Dataset. The folder contains all the images and Excel files for the training, validation, and test set. The Excel file has the following columns:
filename
: image namesquestions
: question for the imageanswers
: answer for the corresponding questionenc_answers
: encoded version of answerscategory
: category of the questions (i.e., yes/no, counting, and other )
You can also run the following command in the conda-terminal
, to download the dataset.
bash download_dataset.sh
Ensure you follow the given folder organization.
Folders need to be organized as follows in the Bengali-VQA
directory.
├── Dataset
| ├── Images
| ├── .jpg
| └── .png
| └── .xlsx files
├── Scripts
├── Ablation # Floder
├── Baselines # Folder
└── mcran.py
└── requirements.txt
Run the following command to train MCRAN on the BVQA
dataset. If you are not in the Scripts
folder.
cd Scripts
python mcran.py \
--nlayer 2 \
--heads 6 \
--learning_rate 1e-5 \
--epochs 15 \
Arguments
--nlayer
: Specifies the number of transformer layers to use (default: 1
).--heads
: Sets the number of attention heads (default: 8
).--epochs
: Specifies the number of training epochs. (default: 10
)--learning_rate
: Sets the learning rate (default: 1e-5
).--model_name
: Specifies the saved model name (default: mcran.pth
).
After evaluation, it will provide the accuracy in each question category: Yes/No, Number, Other, and All.
Ablation
folder contains the following scripts:
layer_head_ablation.py
ablation of the number of transformer layers and attention heads.img_encoder_ablation.py
ablation of different image encoders in the MCRAN. Tested with three encodersResNet
,ConvNext
, andEfficientNet
txt_encoder_ablation.py
ablation of text encoder in the MCRAN. Only tested withMultilingual-DistillBERT
model.
Baselines
folder contains the following scripts:
initial_baselines.ipynb
contains the impelementation of baselines like Vanilla VQA, MFB, MFH, and TDA. All of them were implemented using the TensorFlow framework.hauvqa.py
implementation of the VQA model developed for the Hausa language. [Paper'23]medvqa.py
implementation of an attention-based model developed for medical VQA. [Paper'23]vgclf.py
is another attention-based model developed for medical VQA. [Paper'25]mclip.py
implementation of a fine-tuned multilingual CLIP model. [Paper'23]
Illustration of some case studies from the test set of BVQA where the proposed MCRAN performs well. First, we present the actual answer to a question followed by predicted answers of the state-of-the-art methods and our proposed method. The red cross mark denotes a false prediction, while the green tick mark is the correct prediction.
If you find our works useful for your research and applications, please cite using this BibTeX:
@article{bhuyan2025bvqa,
title={BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering},
author={Bhuyan, Md Shalha Mucha and Hossain, Eftekhar and Sathi, Khaleda Akhter and Hossain, Md Azad and Dewan, M Ali Akber},
journal={IEEE Access},
year={2025},
publisher={IEEE}
}