ICIP, 2025
Mingxuan Liu1,*, Tyler L. Hayes2, Massimiliano Mancini1, Elisa Ricci1,3, Riccardo Volpi4Gabriela Csurka2 (*Corresponding Author)
1University of Trento 2NAVER LABS Europe 3Fondazione Bruno Kessler 4Arsenale Bioyards
- Python 3.8+
- CUDA 11.0+
- PyTorch 1.9+
- Clone the repository:
git clone https://github.com/OatmealLiu/VocAda.git
cd VocAda- Create and activate conda environment:
conda create -n spotdet python=3.8
conda activate spotdet- Install PyTorch and Detectron2:
# Install PyTorch (adjust CUDA version as needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install Detectron2
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'- Install other dependencies:
pip install -r req.txt- Install additional dependencies:
pip install ipython wandb einops mss opencv-python timm dataclasses ftfy regex fasttext scikit-learn lvis nltk Pillow datasets openai tenacity sentence-transformers
pip install git+https://github.com/openai/CLIP.git- Download and organize your datasets in the
datasets/directory - Update dataset paths in configuration files as needed
- For COCO dataset, ensure the following structure:
datasets/
├── coco/
│ ├── val2017/
│ └── zero-shot/
│ └── instances_val2017_all_2_oriorder.json
The SpotDet framework operates in two main stages. Stage 1 handles vocabulary adaptation:
Generate captions for images using vision-language models:
python run_stage1.py \
--query-mode "captioning" \
--dataset-name "coco" \
--model-path "/path/to/llava/model" \
--image-folder "./datasets/coco/val2017" \
--image-anno-path "./datasets/coco/zero-shot/instances_val2017_all_2_oriorder.json" \
--question-file "./stage1_questions/list_all_objects.jsonl" \
--answers-folder "./stage1_answers/coco" \
--answers-file "answered_annotations_coco" \
--num-chunks 1 \
--chunk-idx 1Generate relevant object categories using different methods:
Embedding-based proposal:
python run_stage1.py \
--query-mode "proposing" \
--pipeline-proposing "embedding" \
--dataset-name "coco" \
--embedding-model-name "sbert" \
--embedding-model-size "sbert_base" \
--proposing-thresh 0.15LLM-based proposal:
python run_stage1.py \
--query-mode "proposing" \
--pipeline-proposing "llm" \
--dataset-name "coco" \
--llm-model-name "gpt-3.5-turbo-0125" \
--llm-temperature 1.0Tagging-based proposal:
python run_stage1.py \
--query-mode "proposing" \
--pipeline-proposing "tagging" \
--dataset-name "coco" \
--tags-file "datasets/tags/coco_rampp_tags_openset.json" \
--pretrained-rampp-path "/path/to/rampp/model"python run_stage1.py \
--query-mode "merging" \
--dataset-name "coco" \
--answers-folder "./stage1_answers/coco" \
--answers-file "answered_annotations_coco"python run_stage1.py \
--query-mode "add_synonyms" \
--dataset-name "coco" \
--llm-model-name "gpt-3.5-turbo-0125" \
--answers-folder "./stage1_answers/coco" \
--answers-file "answered_annotations_coco"python run_stage1.py \
--query-mode "merging_llm_proposals" \
--dataset-name "coco" \
--answers-folder "./stage1_answers/coco" \
--answers-file "answered_annotations_coco"python run_stage1.py \
--query-mode "merging_llm_proposals_with_clip" \
--dataset-name "coco" \
--embedding-model-name "clip" \
--embedding-model-size "ViT-L/14" \
--proposing-thresh 0.15 \
--answers-folder "./stage1_answers/coco" \
--answers-file "answered_annotations_coco"The SpotDet framework provides several key modules for vocabulary adaptation:
- Captioner: Image captioning using vision-language models
- Proposer: Category proposal methods (embedding, LLM, tagging)
- Add Synonyms: Vocabulary expansion with synonyms
- Merging: Combining different proposal methods
The scripts_OV-COCO/ directory contains ready-to-use scripts for different experiments:
- Stage1-a/: Captioning scripts
- Stage1-b/: Proposal generation scripts
Example usage:
# Run captioning
bash scripts_OV-COCO/Stage1-a/a_chunk_1-10_llava_captioning.sh
# Run embedding-based proposal
bash scripts_OV-COCO/Stage1-b/b_chunk_1-10_embedding_proposing.shSpotDet/
├── SpotDet/ # Core SpotDet framework
│ ├── captioner.py # Image captioning modules
│ ├── proposer/ # Category proposal methods
│ │ ├── similarity_proposer.py # Embedding-based proposals
│ │ ├── llm_proposer.py # LLM-based proposals
│ │ ├── tagger_proposer.py # Tagging-based proposals
│ │ └── gtruth_proposer.py # Ground truth proposals
│ ├── add_synonyms.py # Synonym generation
│ └── utils.py # Utility functions
├── datasets/ # Dataset configurations and metadata
├── scripts_OV-COCO/ # Experiment scripts
└── run_stage1.py # Main vocabulary adaptation pipeline
- COCO: COCO-80 (80 classes)
- Objects365: Objects365 v2
- LLaVA: LLaVA-1.6-Mistral-7B or LLaVA-1.6-34B
- CLIP: ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336
- OpenAI GPT: GPT-3.5-turbo, GPT-4
- Local LLMs: LLaMA3-8B, LLaMA3-70B
- CLIP: ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336
- Sentence-BERT: sbert_mini, sbert_base, sbert_search
- OpenAI Embeddings: emb3_small, emb3_large
@article{liu2025test,
title={Test-time Vocabulary Adaptation for Language-driven Object Detection},
author={Liu, Mingxuan and Hayes, Tyler L and Mancini, Massimiliano and Ricci, Elisa and Volpi, Riccardo and Csurka, Gabriela},
journal={arXiv preprint arXiv:2506.00333},
year={2025}
}