LLaVA-UHD

A Large Multimodal Model Perceiving Any Aspect Ratio and High-Resolution Images

This repository hosts the code, data, and model weight of LLaVA-UHD, a novel framework that enables Large Multimodal Models (LMMs) to efficiently perceive images in any aspect ratio and high resolution. Notably, our model built on LLaVA-1.5 336×336 supports 6 times larger (i.e., 672×1088) resolution images and achieves 5.7 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within ~1 day on 8 A100 GPUs. Visit our 📃 paper here!

Overview

LLaVA-UHD includes three key components to deal with native-resolution images:

An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding.
A novel compression module (spatially constrained resampler) that further condenses image tokens from visual encoders.
A spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD out- performs established LMMs trained with 2-3 orders of magnitude more data on 8 benchmarks.
Better and robust performance in limited training datasets

Release

-[2024/07/29] 🔥LLaVA-UHD achieves performance improvement on 8 common benchmarks beyong LLaVA-1.5. Our novel projector, spatially constrained resampler, realizes high feature compression and convergence efficiency. Model checkpoints are available in hugging-face.

-[2024/07/01] 📢LLaVA-UHD is accepted by ECCV2024.

Environment Preparing

To reproduce the results of the paper, please set up the Python environment using the following code:

conda create -n llava-uhd python=3.10
conda activate llava-uhd
pip install -r requirements.txt
sh install.sh

Download the checkpoints of CLIP-ViT-L/14 and Vicuna-13B-v1.5. And put them into ./pretrained_models. In the checkpoint path of vicuna-13b-v1.5, set 'do_sample' in 'generation_config.json' as 'True', otherwise there is an error when saving training checkpoint.

If something wrong happens, please kindly refer to the issues of LLaVA or submit issues in our repository.

Data Preparing

Pretraining Data: Download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here. And put the data into ./playground/data. Also could refer to the documentation of LLaVA for detailed data organization.
Fine-tuning Data: Please download the annotation of the final mixture our instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as .jpg
- TextCaps: train_val_images
- VisualGenome: part1, part2
Download dataset images as in the finetuning process of LLaVA-1.5, place them in the ./playground/data

Training Script

Please refer to train.sh for pretraining script and fine-tuning script (we comment in the file). If you want to do end-to-end pretraining, fine-tuning and evalutation, please run the following command.

sh train.sh

Evaluation Code

Evaluation script is in eval.sh, you can run

sh eval.sh dir_name_in_checkpoints_new
# e.g. sh eval.sh llava-uhd-144-13b
# llava-uhd-144-13b is the dir_name stored in the path of ./checkpoints_new

For details of data organization, please refer to here for help. We provide the same script to complete the testing.

Citation

If you find LLaVA-UHD useful for your research and applications, please cite using this BibTeX:

@inproceedings{guo2024llava-uhd,
  title={{LLaVA-UHD}: an LMM Perceiving Any Aspect Ratio and High-Resolution Images},
  author={Guo, Zonghao and Xu, Ruyi and Yao, Yuan and Cui, Junbo and Ni, Zanlin and Ge, Chunjiang and Chua, Tat-Seng and Liu, Zhiyuan and Huang, Gao},
  booktitle={ECCV},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
llava		llava
scripts		scripts
.gitignore		.gitignore
LLaVA-UHD.jpg		LLaVA-UHD.jpg
README.md		README.md
eval.sh		eval.sh
install.sh		install.sh
pyproject.toml		pyproject.toml
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaVA-UHD

Overview

Release

Environment Preparing

Data Preparing

Training Script

Evaluation Code

Citation

About

Releases

Packages

Contributors 4

Languages

thunlp/LLaVA-UHD

Folders and files

Latest commit

History

Repository files navigation

LLaVA-UHD

Overview

Release

Environment Preparing

Data Preparing

Training Script

Evaluation Code

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages