Skip to content

[ICCV'25] The official code of paper "Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models"

License

Notifications You must be signed in to change notification settings

thu-nics/FrameFusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models

🌐 Project Page   |   📑 arXiv

FrameFusion reduces the number of tokens in Large Vision-Language Models (LVLMs) by combining similarity-based merging with importance-based pruning. It achieves a 70% vision token reduction, 3.4–4.4× LLM speedups, and 1.6–1.9× end-to-end speedups with minimal performance impact.

demo.mp4

This demo can be reproduced with script/demo/llava_video_compare.py.

Feel free to star the repo or cite the paper if you find it interesting.

@article{fu2024framefusion,
  title={FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models},
  author={Fu, Tianyu and Liu, Tengxuan and Han, Qinghao and Dai, Guohao and Yan, Shengen and Yang, Huazhong and Ning, Xuefei and Wang, Yu},
  journal={arXiv preprint arXiv:2501.01986},
  year={2024}
}

News

  • [2025/08] Update webpage, check our interactive demos here

  • [2025/06] Our paper is accepted by ICCV'25

  • [2025/05] Support Qwen2-VL and InternVL2.5

  • [2025/04] Support NVILA model family

Environment Setup

General

Create a new environment:

conda create -n framefusion python=3.10
conda activate framefusion

Install FrameFusion:

pip install -e .

Working with Other Models

Important: NVILA and Llava-Video have conflicting architectures. FrameFusion supports both, but please install only one to avoid conflicts.

Llava-Video

To install Llava-Video LVLM dependencies:

  1. Clone the LLaVA-NeXT repository:
    git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
    cd LLaVA-NeXT
  2. Install via:
    pip install -e .[llava-video]

NVILA

To install NVILA dependencies:

  1. Clone the VILA repository:
    git clone https://github.com/NVlabs/VILA.git
    cd VILA
  2. Run environment setup script to install dependencies in current conda environment:
    ./environment_setup.sh
  3. Install via:
    pip install -e .

Qwen2-VL

After standard installation, please reinstall transformers==4.51.3 to ensure version compatibility.

pip install -e .[qwen2-vl]

For all other models, continue using transformers==4.45.2.

How to

Run an example

We provide an example with LLaVA-Video-7B model to inference on a video with or without FrameFusion in script/playground/example_llava.py.

python script/playground/example_llava.py

Apply FrameFusion

You can apply FrameFusion in your own code to any huggingface model that supports the interface with few lines of code. Here is an example:

from llava.model.builder import load_pretrained_model
from framefusion.interface import apply_framefusion

# set attn_implementation to be sdpa
tokenizer, model, image_processor, max_length = load_pretrained_model("lmms-lab/LLaVA-Video-7B-Qwen2", None, "llava_qwen", torch_dtype="bfloat16", attn_implementation='sdpa', device_map="auto")

# apply FrameFusion
apply_framefusion(model, cost=0.3, similarity_lower_bound=0.6, ratio_lower_bound=0.1)

# use the model as usual

Evaluate FrameFusion

We use lmms-eval to evaluate FrameFusion. To apply FrameFusion, clone the official lmms-eval repository, install it from source, and insert the following lines into evaluator.py after the standard model initialization of lm (around line187):

from framefusion.interface import apply_framefusion
model_to_compress = getattr(lm, "_model", lm.model)
apply_framefusion(model_to_compress, cost=0.3, similarity_lower_bound=<S_th from our paper>, ratio_lower_bound=0.1)

Please refer to our paper for the recommended similarity_lower_bound (S_th) values for different models.

Adapt to new models

Understand Code Structure

  • framefusion/: The main package for FrameFusion.
    • models/: The adapter for different models.
    • main.py: The main implementation of FrameFusion.
    • interface.py: The interface for applying FrameFusion.
  • scripts/: Scripts for running experiments.
    • evaluate/: Scripts for evaluating the performance models.
    • playground/: Scripts for running misc experiments.
  • example/: Example input videos

Modify the code

  1. Add a new model adapter in framefusion/models/, it applies framefusion after the attention module.

    Three model functions are required: llm_forward, decoder_forward, and attention_forward. The forward functions are easily modified from the corresponding modeling_<MODEL>.py functions in huggingface transformers. All modifications are marked with ### comments. For LLM, see framefusion/models/qwen2/modeling_qwen2.py as an example.

  2. Register the model in framefusion/interface.py, it applies framefusion to the correct model class.

  3. Add a new example in script/playground/, it shows how to apply framefusion to the model.

Happy to help

If you have any questions on applying FrameFusion to a new model, please feel free to open an issue. We are happy to help you and expand the adapter for more models.

Supported Model List

MimiCPM-V

Llava-Video

NVILA

Qwen2-VL

Note: Please use transformers==4.51.3 when running Qwen2-VL series models.

InternVL2_5

About

[ICCV'25] The official code of paper "Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages