Open Ended Bengali Visual Question Answering

This repository is the official implementation of the paper BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering

Pipeline of BVQA

Abstract view of the proposed BVQA data generation method. The left block shows the context, such as the caption used to prompt LLM, and the right block shows the responses (i.e., QA pairs) generated by LLM. Note that the visual image is not used to prompt LLM; it is only shown here as a reference.

Multimodal Cross-Attention Network for Bengali-VQA

Overall framework of our proposed Bengali VQA model, MCRAN. The model takes image and question as input and generates embeddings with their corresponding encoder. The top block in the middle part produces two cross-modal attentive representations: ICAR (Image Weighted Cross-modal Representation) and TCAR (Text Weighted Cross-modal Representation). In contrast, the bottom block creates a token-level multimodal attentive representation (MMAR). Finally, our method fuses these three attentive knowledge vectors through a gating mechanism to obtain richer multimodal features.

Instructions

To reproduce the results, you need to install Python=3.10.x. All the models are implemented using Pytorch=2.4.0. The MCRAN requires a GPU with 16GB RAM.
If you use any IDE, first clone (git clone <url>) the repository. Then, create a virtual environment and activate it.
```
conda create -n Bengali-VQA Python=3.10.12
conda activate Bengali-VQA
```
Install all the dependencies.
```
pip install -r requirements.txt
```

Dataset

The dataset can be downloaded from this link: BVQA-Dataset. The folder contains all the images and Excel files for the training, validation, and test set. The Excel file has the following columns:

filename: image names
questions: question for the image
answers: answer for the corresponding question
enc_answers: encoded version of answers
category: category of the questions (i.e., yes/no, counting, and other )

You can also run the following command in the conda-terminal, to download the dataset.

bash download_dataset.sh

Ensure you follow the given folder organization.

Folder Organization

Folders need to be organized as follows in the Bengali-VQA directory.

├── Dataset
|   ├── Images
    |  ├── .jpg
    |  └── .png
|   └── .xlsx files
├── Scripts
   ├── Ablation  # Floder
   ├── Baselines # Folder
   └── mcran.py
└── requirements.txt

Traning and Evaluation of MCRAN

Run the following command to train MCRAN on the BVQA dataset. If you are not in the Scripts folder.

cd Scripts

python mcran.py \
  --nlayer 2 \
  --heads 6 \
  --learning_rate 1e-5 \
  --epochs 15 \

Arguments

--nlayer: Specifies the number of transformer layers to use (default: 1).
--heads: Sets the number of attention heads (default: 8).
--epochs: Specifies the number of training epochs. (default: 10)
--learning_rate: Sets the learning rate (default: 1e-5).
--model_name: Specifies the saved model name (default: mcran.pth).

After evaluation, it will provide the accuracy in each question category: Yes/No, Number, Other, and All.

Ablation

Ablation folder contains the following scripts:

layer_head_ablation.py ablation of the number of transformer layers and attention heads.
img_encoder_ablation.py ablation of different image encoders in the MCRAN. Tested with three encoders ResNet, ConvNext, and EfficientNet
txt_encoder_ablation.py ablation of text encoder in the MCRAN. Only tested with Multilingual-DistillBERT model.

Baselines

Baselines folder contains the following scripts:

initial_baselines.ipynb contains the impelementation of baselines like Vanilla VQA, MFB, MFH, and TDA. All of them were implemented using the TensorFlow framework.
hauvqa.py implementation of the VQA model developed for the Hausa language. [Paper'23]
medvqa.py implementation of an attention-based model developed for medical VQA. [Paper'23]
vgclf.py is another attention-based model developed for medical VQA. [Paper'25]
mclip.py implementation of a fine-tuned multilingual CLIP model. [Paper'23]

Case Study

Illustration of some case studies from the test set of BVQA where the proposed MCRAN performs well. First, we present the actual answer to a question followed by predicted answers of the state-of-the-art methods and our proposed method. The red cross mark denotes a false prediction, while the green tick mark is the correct prediction.

Citation

If you find our works useful for your research and applications, please cite using this BibTeX:

@article{bhuyan2025bvqa,
  title={BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering},
  author={Bhuyan, Md Shalha Mucha and Hossain, Eftekhar and Sathi, Khaleda Akhter and Hossain, Md Azad and Dewan, M Ali Akber},
  journal={IEEE Access},
  year={2025},
  publisher={IEEE}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Open Ended Bengali Visual Question Answering

Pipeline of BVQA

Multimodal Cross-Attention Network for Bengali-VQA

Instructions

Dataset

Folder Organization

Traning and Evaluation of MCRAN

Ablation

Baselines

Case Study

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Open Ended Bengali Visual Question Answering

Pipeline of BVQA

Multimodal Cross-Attention Network for Bengali-VQA

Instructions

Dataset

Folder Organization

Traning and Evaluation of MCRAN

Ablation

Baselines

Case Study

Citation