Cross-modal Information Flow in Multimodal Large Language Model

This is the official repository for our CVPR paper: Cross-modal Information Flow in Multimodal Large Language Models

Installation

1. git clone https://github.com/FightingFighting/cross-modal-information-flow-in-MLLM.git
2. cd cross-modal-information-flow-in-MLLM
3. Please following LLaVANEXT(https://github.com/LLaVA-VL/LLaVA-NeXT) to install the environment: llava

After installing llava environment, you will find an LLaVA-NeXT folder in cross-modal-information-flow-in-MLLM

Dataset

Our dataset is collected from GQA. The collected datasets are in datasets. For the images, please download from here.

Use

Information flow

Open scripts/informationFlow.sh
Setting:

current_window: how many layers you want to block for the attention knock at a time;

current_block_desc: which kind of information flow you want to block;

model_path: which kind of model you want to explore;

convmode: different kind of model has different convmode;

dataset: which kind of task you want to explore;

imagefolder: the image folder
Run sbatch scripts/informationFlow.sh

current_block_desc can be chosen from:

  "Question->Last"
  "Image->Last"
  "Image->Question"
  "Last->Last"
  "Image Central Object->Question"
  "Image Without Central Object->Question"

model_path and convmode can be chosen from:

  model_path="liuhaotian/llava-v1.6-vicuna-7b" convmode="vicuna_v1"
  model_path="lmms-lab/llama3-llava-next-8b"  convmode="llava_llama_3"
  model_path="liuhaotian/llava-v1.5-7b"   convmode="vicuna_v1"
  model_path="liuhaotian/llava-v1.5-13b"   convmode="vicuna_v1"

dataset can be chosen from:

  datasets/GQA_val_correct_question_with_choose_ChooseAttr.csv
  datasets/GQA_val_correct_question_with_positionQuery_QueryAttr.csv
  datasets/GQA_val_correct_question_with_existThatOr_LogicalObj.csv
  datasets/GQA_val_correct_question_with_twoCommon_CompareAttr.csv
  datasets/GQA_val_correct_question_with_relChooser_ChooseRel.csv
  datasets/GQA_val_correct_question_with_categoryThatThisChoose_objThisChoose_ChooseCat.csv

Probability of answer word tracking

Open scripts/last_position_answer_prob.sh
Setting:

model_path: which kind of model you want to explore;

convmode: different kind of model has different convmode;

dataset: which kind of task you want to explore;

imagefolder: the image folder
Run sbatch scripts/last_position_answer_prob.sh

Visulization

if you want to merge several lines into one figure, you can run python vil/merge_lineplot.py.

For example, you might already get the results of the information flow: Question->Last,Image->Last,Last->Last, and you want to merge these three lines in one Figure, and then you could run python vil/merge_lineplot.py.

Cite

If this project is helpful for you, please cite our paper:

@article{zhang2024cross,
  title={Cross-modal Information Flow in Multimodal Large Language Models},
  author={Zhang, Zhi and Yadav, Srishti and Han, Fengze and Shutova, Ekaterina},
  journal={arXiv preprint arXiv:2411.18620},
  year={2024}
}

Acknowledgement

The code is built upon https://github.com/google-research/google-research/tree/master/dissecting_factual_predictions and LLaVA.

Our used datasets are collected from GQA

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datasets		datasets
scripts		scripts
vil		vil
InformationFlow.py		InformationFlow.py
LICENSE		LICENSE
README.md		README.md
last_position_answer_prob.py		last_position_answer_prob.py
methods.py		methods.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cross-modal Information Flow in Multimodal Large Language Model

Installation

Dataset

Use

Information flow

Probability of answer word tracking

Visulization

Cite

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

FightingFighting/cross-modal-information-flow-in-MLLM

Folders and files

Latest commit

History

Repository files navigation

Cross-modal Information Flow in Multimodal Large Language Model

Installation

Dataset

Use

Information flow

Probability of answer word tracking

Visulization

Cite

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages