WeiMoCIR

The official implementation for:

Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity

Introduction

We introduce a training-free approach for ZS-CIR.

Our approach, Weighted Modality fusion and similarity for CIR (WeiMoCIR), operates under the assumption that image and text modalities can be effectively combined using a simple weighted average. This allows the query representation to be constructed directly from the reference image and text modifier.

To further enhance retrieval performance, we employ multimodal large language models (MLLMs) to generate image captions for the database images and incorporate these textual captions into the similarity computation by combining them with image information using a weighted average.

Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets

Presentation Slide

Download the Slide

Click to see the details

Overview of the proposed WeiMoCIR, a training-free approach for zero-shot composed image retrieval (ZS-CIR).

Leveraging pretrained VLMs and MLLMs, our method comprises three modules:

Weighted Modality Fusion for Query Composition
Enhanced Representations through MLLM-generated image captions
Weighted Modality Similarity, which integrates both query-to-image and query-to-caption similarities for retrieval.

Setting up

First, clone the repository to a desired location.

Prerequisites

The following commands will create a local Anaconda environment with the necessary packages installed.

conda create -n wei_mo_cir -y python=3.8
conda activate wei_mo_cir
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
export PYTHONPATH=$(pwd)

Download Pre-trained Weight

We use these pre-trained models BLIP w/ ViT-B. For BLIP checkpoint download, please refer to the following links:

For the CLIP model, you will download the model from the Hugging Face model hub. So you don't need to download the model manually.

Here is the link to each model:

CLIP-ViT-B-32: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
CLIP-ViT-L-14: laion/CLIP-ViT-L-14-laion2B-s32B-b82K
CLIP-ViT-H-14: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
CLIP-ViT-G-14: Geonmo/CLIP-Giga-config-fixed
CLIP-ViT-G-14: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k (At this moment the config file is already fixed, the results will be identical to Geonmo/CLIP-Giga-config-fixed)

The download BLIP model should be placed in the models folder.

models/
    model_base.pth
    model_base_retrieval_coco.pth
    model_base_retrieval_flickr.pth
    model_large.pth
    model_large_retrieval_coco.pth
    model_large_retrieval_flickr.pth

FashionIQ Dataset

The FashionIQ dataset can be downloaded from the following link:

Fashion-IQ

The dataset should be placed in the fashionIQ_dataset folder.

fashionIQ_dataset/
    labeled_images_cir_cleaned.json
    captions/
        cap.dress.test.json
        cap.dress.train.json
        cap.dress.val.json
        ...
    image_splits/
        split.dress.test.json
        split.dress.train.json
        split.dress.val.json
        ...
    images/
        245600258X.png
        978980539X.png
        ...

CIRR Dataset

The CIRR dataset can be downloaded from the following link:

CIRR

The dataset should be placed in the cirr_dataset folder.

cirr_dataset/
    train/
        0/
            train-10108-0-img0.png
            train-10108-0-img1.png
            train-10108-1-img0.png
            ...
        1/
            train-10056-0-img0.png
            train-10056-0-img1.png
            train-10056-1-img0.png
            ...
        ...
    dev/
        dev-0-0-img0.png
        dev-0-0-img1.png
        dev-0-1-img0.png
        ...
    test1/
        test1-0-0-img0.png
        test1-0-0-img1.png
        test1-0-1-img0.png
        ...
    cirr/
        captions/
            cap.rc2.test1.json
            cap.rc2.train.json
            cap.rc2.val.json
        image_splits/
            split.rc2.test1.json
            split.rc2.train.json
            split.rc2.val.json

Note

Please modify the requirements.txt file if you use a different version of torch with different CUDA version.
Make sure change the PYTHONPATH to the current directory. Or the code will not be able to find the necessary modules.

Code breakdown

Our code is based on the validation script of the Bi-BlipCIR repository.

The main differences are:

We modified the elements wide sum with additional weight alpha. For weighted the text and image features.
We use MLLM to generate the image captions and use it as the text features for index image.
We calculate the additional distance between the merge query feature and the MLLM text descriptions for index image. And then use mean of the text distances as the index text feature.
Finally, we combine the index image feature and index text feature with the weight beta for the final retrieval candidate.

The main code changes are in the src/validate.py, src/validate_clip.py and src/utils.py files.

Reproducing results

For benchmark evaluation

For FashionIQ dataset

Reproducing the results of the BLIP VIT-B models and BLIP VIT-L models, You can check out these checkpoints

models/model_base.pth
models/model_base_retrieval_coco.pth
models/model_base_retrieval_flickr.pth
models/model_large.pth
models/model_large_retrieval_coco.pth
models/model_large_retrieval_flickr.pth

For example, to reproduce the results in Ablation Study with the BLIP with retrieval training on COCO dataset:

python src/validate.py --dataset fashionIQ \
                       --blip-pretrained-path models/model_base_retrieval_coco.pth \
                       --combining-function sum \
                       --text_captions_path fashionIQ_dataset/labeled_images_cir_cleaned.json \
                       --blip-vit base \
                       --alpha 0.95 --beta 0.2

[!NOTE] You should change the --blip-vit into large for the BLIP VIT-L models.

Reproducing the results of the CLIP models You can change the clip_name into below for the results in CLIP

VIT-B32: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
VIT-L14: laion/CLIP-ViT-L-14-laion2B-s32B-b82K
VIT-H14: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
VIT-G14: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

python src/validate_clip.py --dataset FashionIQ \
                            --clip_name laion/CLIP-ViT-bigG-14-laion2B-39B-b160k \
                            --text_captions_path fashionIQ_dataset/labeled_images_cir_cleaned.json \
                            --alpha 0.8 --beta 0.1

For CIRR dataset

Reproducing the results of the BLIP VIT-B models and BLIP VIT-L models, Similar to the FashionIQ dataset, you can check out these checkpoints

models/model_base.pth
models/model_base_retrieval_coco.pth
models/model_base_retrieval_flickr.pth
models/model_large.pth
models/model_large_retrieval_coco.pth
models/model_large_retrieval_flickr.pth

For example, to reproduce the results of the BLIP with retrieval training on COCO dataset:

python src/cirr_test_submission.py --submission-name submit_blip_vit_base_coco \
                                   --combining-function sum \
                                   --blip-pretrained-path models/model_base_retrieval_coco.pth \
                                   --text_captions_path cirr_dataset/cirr_labeled_images_cir_cleaned.json \
                                   --blip-vit base \
                                   --alpha 0.95 --beta 0.2

[!NOTE] You should change the --blip-vit into large for the BLIP VIT-L models.

Reproducing the results of the CLIP models, You can change the clip_name into below for the results in CLIP

VIT-B32: laion/CLIP-ViT-B-32-laion2B-s34B-b79K
VIT-L14: laion/CLIP-ViT-L-14-laion2B-s32B-b82K
VIT-H14: laion/CLIP-ViT-H-14-laion2B-s32B-b79K
VIT-G14: laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

For example, to reproduce the results of the CLIP with CLIP ViT L/14:

python src/cirr_test_submission_clip.py --submission-name submit_clip_vit_l \
                                    --combining-function sum \
                                    --clip_name laion/CLIP-ViT-L-14-laion2B-s32B-b82K \
                                    --text_captions_path cirr_dataset/cirr_labeled_images_cir_cleaned.json \
                                    --alpha 0.8 --beta 0.1

For other experiments

The experiments are divided into three folders:

src/ablation_experiment/: This folder contains the code for the ablation experiment.
src/cirr_experiment/: This folder contains the code for the CIRR dataset experiment.
src/fashioniq_experiment/: This folder contains the code for the FashionIQ dataset experiment.

Feel free to explore the code and run the experiments.

License

We use the same MIT License as the Bi-BlipCIR, CLIP4Cir and BLIP.

Citation

Special thanks to the Bi-BlipCIR. We use the code to evaluate the performance of our proposed method. If you find this code useful for your research, please consider citing the original paper:

@misc{wu2024-weimocir,
      title={Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity}, 
      author={Ren-Di Wu and Yu-Yen Lin and Huei-Fang Yang},
      year={2024},
      eprint={2409.04918},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.04918}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
cirr_dataset		cirr_dataset
demo_images		demo_images
fashionIQ_dataset		fashionIQ_dataset
src		src
submission		submission
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TAAI_I79.pdf		TAAI_I79.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeiMoCIR

Introduction

Presentation Slide

Setting up

Code breakdown

Reproducing results

For benchmark evaluation

For other experiments

License

Citation

About

Releases

Packages

Languages

License

whats2000/WeiMoCIR

Folders and files

Latest commit

History

Repository files navigation

WeiMoCIR

Introduction

Presentation Slide

Setting up

Code breakdown

Reproducing results

For benchmark evaluation

For other experiments

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages