Skip to content

Latest commit

 

History

History
130 lines (98 loc) · 6.06 KB

README.md

File metadata and controls

130 lines (98 loc) · 6.06 KB

Open Ended Bengali Visual Question Answering

This repository is the official implementation of the paper BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering

Pipeline of BVQA


Abstract view of the proposed BVQA data generation method. The left block shows the context, such as the caption used to prompt LLM, and the right block shows the responses (i.e., QA pairs) generated by LLM. Note that the visual image is not used to prompt LLM; it is only shown here as a reference.


Multimodal Cross-Attention Network for Bengali-VQA


Overall framework of our proposed Bengali VQA model, MCRAN. The model takes image and question as input and generates embeddings with their corresponding encoder. The top block in the middle part produces two cross-modal attentive representations: ICAR (Image Weighted Cross-modal Representation) and TCAR (Text Weighted Cross-modal Representation). In contrast, the bottom block creates a token-level multimodal attentive representation (MMAR). Finally, our method fuses these three attentive knowledge vectors through a gating mechanism to obtain richer multimodal features.

Instructions

  • To reproduce the results, you need to install Python=3.10.x. All the models are implemented using Pytorch=2.4.0. The MCRAN requires a GPU with 16GB RAM.

  • If you use any IDE, first clone (git clone <url>) the repository. Then, create a virtual environment and activate it.

    conda create -n Bengali-VQA Python=3.10.12
    conda activate Bengali-VQA
    
  • Install all the dependencies.

    pip install -r requirements.txt
    

Dataset

The dataset can be downloaded from this link: BVQA-Dataset. The folder contains all the images and Excel files for the training, validation, and test set. The Excel file has the following columns:

  • filename: image names
  • questions: question for the image
  • answers: answer for the corresponding question
  • enc_answers: encoded version of answers
  • category: category of the questions (i.e., yes/no, counting, and other )

You can also run the following command in the conda-terminal, to download the dataset.

bash download_dataset.sh

Ensure you follow the given folder organization.

Folder Organization

Folders need to be organized as follows in the Bengali-VQA directory.

├── Dataset
|   ├── Images
    |  ├── .jpg
    |  └── .png
|   └── .xlsx files
├── Scripts
   ├── Ablation  # Floder
   ├── Baselines # Folder
   └── mcran.py
└── requirements.txt           

Traning and Evaluation of MCRAN

Run the following command to train MCRAN on the BVQA dataset. If you are not in the Scripts folder.

cd Scripts

python mcran.py \
  --nlayer 2 \
  --heads 6 \
  --learning_rate 1e-5 \
  --epochs 15 \

Arguments

  • --nlayer: Specifies the number of transformer layers to use (default: 1).
  • --heads: Sets the number of attention heads (default: 8).
  • --epochs: Specifies the number of training epochs. (default: 10)
  • --learning_rate: Sets the learning rate (default: 1e-5).
  • --model_name: Specifies the saved model name (default: mcran.pth).

After evaluation, it will provide the accuracy in each question category: Yes/No, Number, Other, and All.

Ablation

Ablation folder contains the following scripts:

  • layer_head_ablation.py ablation of the number of transformer layers and attention heads.
  • img_encoder_ablation.py ablation of different image encoders in the MCRAN. Tested with three encoders ResNet, ConvNext, and EfficientNet
  • txt_encoder_ablation.py ablation of text encoder in the MCRAN. Only tested with Multilingual-DistillBERT model.

Baselines

Baselines folder contains the following scripts:

  • initial_baselines.ipynb contains the impelementation of baselines like Vanilla VQA, MFB, MFH, and TDA. All of them were implemented using the TensorFlow framework.
  • hauvqa.py implementation of the VQA model developed for the Hausa language. [Paper'23]
  • medvqa.py implementation of an attention-based model developed for medical VQA. [Paper'23]
  • vgclf.py is another attention-based model developed for medical VQA. [Paper'25]
  • mclip.py implementation of a fine-tuned multilingual CLIP model. [Paper'23]

Case Study


Illustration of some case studies from the test set of BVQA where the proposed MCRAN performs well. First, we present the actual answer to a question followed by predicted answers of the state-of-the-art methods and our proposed method. The red cross mark denotes a false prediction, while the green tick mark is the correct prediction.

Citation

If you find our works useful for your research and applications, please cite using this BibTeX:

@article{bhuyan2025bvqa,
  title={BVQA: Connecting Language and Vision Through Multimodal Attention for Open-Ended Question Answering},
  author={Bhuyan, Md Shalha Mucha and Hossain, Eftekhar and Sathi, Khaleda Akhter and Hossain, Md Azad and Dewan, M Ali Akber},
  journal={IEEE Access},
  year={2025},
  publisher={IEEE}
}