Reflective LLaVA (ReflectiVA)

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
(CVPR 2025)

This repository contains the reference code for the paper Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering.

🎯 Project web page | Paper | 🤗 HuggingFace Model | 🤗 HuggingFace Dataset |

Citation

Please cite this work with the following BibTeX:

@inproceedings{cocchi2024augmenting,
  title={{Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering}},
  author={Cocchi, Federico and Moratelli, Nicholas and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Overview

Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database, ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed.

Installation

To create the conda environment named reflectiva use the following instructions. With this environment you have all the packages to run the code inside this repo.

conda create -n reflectiva python==3.8.16
conda activate reflectiva
pip install -r requirements.txt

Model

You can access the official model weights for the ReflectiVA model on 🤗 Hugging Face.

Dataset

The official training dataset can be accessed on 🤗 Hugging Face.

cd <data_local_path>
!pip install huggingface_hub

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id="aimagelab/ReflectiVA-Data", repo_type="dataset", local_dir="<data_local_path>")"

Please note that the JSON file includes only the relative paths to the images. To access the actual images, you’ll need to download them from their original sources: infoseek, encycopedic and llava.

Data Infoseek

You can use this link to download the evaluation data for Infoseek.

Data Encyclopedic-VQA

You can find the evaluation data for Encyclopedic-VQA at this link. Additionally, the images used for evaluation can be extracted from this zip file.

Training

Before starting the training of ReflectiVA, make sure to set up the environment and download the dataset to your local machine. Additionally, update the absolute paths in the functions starting with fill_abs_path to correctly point to the image locations in your configuration. Once everything is set up, you can launch the training job using the following command:

cd ./ReflectivA
bash scripts/train_reflectiva.sh

Knowledge Bases and Reproducibility

In our work two different main knowledge bases are utilized. To enhance the reproducibility of our approach, we provide access to both the knowledge bases and the FAISS index built on them for the best configuration presented in the paper. Specifically, the embeddings are generated using the EVA-CLIP model.

For Infoseek, you can find the index and json file inside this zip file. Similarly, the index and json file for Encyclopedic-VQA are available here.

Please refer to the paper to more information about KB.

Inference

Before running the inference, unzip the data and modify the paths in the .sh files to align with your local cluster setup and the files downloaded in the previous step.

Inference code for Infoseek:

sbatch scripts/ReflectiVA_infoseek.sh

Inference code for Encyclopedic-VQA:

sbatch scripts/ReflectiVA_evqa.sh

Acknowledgements

We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. This work has been conducted under a research grant co-funded by Altilia s.r.l., and supported by the PNRRM4C2 project “FAIR - Future Artificial Intelligence Research”, funded by the European Commission, and by the PNRR project “Italian Strengthening of Esfri RI Resilience” (ITSERR) funded by the European Union - NextGenerationEU (CUP B53C22001770006).

We are thankful to LLaVA, lmms-eval for releasing their models and code as open-source contributions.

Finally, we would also like to thank Davide Caffagni and Sara Sarto for their valuable support and insights.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
images		images
llava		llava
rag_evaluation		rag_evaluation
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reflective LLaVA (ReflectiVA)

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
(CVPR 2025)

Table of Contents

Citation

Overview

Installation

Model

Dataset

Data Infoseek

Data Encyclopedic-VQA

Training

Knowledge Bases and Reproducibility

Inference

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

aimagelab/ReflectiVA

Folders and files

Latest commit

History

Repository files navigation

Reflective LLaVA (ReflectiVA)

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering (CVPR 2025)

Table of Contents

Citation

Overview

Installation

Model

Dataset

Data Infoseek

Data Encyclopedic-VQA

Training

Knowledge Bases and Reproducibility

Inference

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
(CVPR 2025)

Packages