Test-Time Vocabulary Adaptation for Language-Driven Object Detection

ICIP, 2025

Mingxuan Liu^1,*, Tyler L. Hayes², Massimiliano Mancini¹, Elisa Ricci^1,3, Riccardo Volpi⁴Gabriela Csurka² (*Corresponding Author)

_{¹University of Trento
²NAVER LABS Europe
³Fondazione Bruno Kessler
⁴Arsenale Bioyards}

Installation

Prerequisites

Python 3.8+
CUDA 11.0+
PyTorch 1.9+

Setup Environment

Clone the repository:

git clone https://github.com/OatmealLiu/VocAda.git
cd VocAda

Create and activate conda environment:

conda create -n spotdet python=3.8
conda activate spotdet

Install PyTorch and Detectron2:

# Install PyTorch (adjust CUDA version as needed)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install Detectron2
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Install other dependencies:

pip install -r req.txt

Install additional dependencies:

pip install ipython wandb einops mss opencv-python timm dataclasses ftfy regex fasttext scikit-learn lvis nltk Pillow datasets openai tenacity sentence-transformers
pip install git+https://github.com/openai/CLIP.git

Dataset Setup

Download and organize your datasets in the datasets/ directory
Update dataset paths in configuration files as needed
For COCO dataset, ensure the following structure:

datasets/
├── coco/
│   ├── val2017/
│   └── zero-shot/
│       └── instances_val2017_all_2_oriorder.json

Usage

Stage 1: Vocabulary Adaptation Pipeline

The SpotDet framework operates in two main stages. Stage 1 handles vocabulary adaptation:

1. Image Captioning

Generate captions for images using vision-language models:

python run_stage1.py \
    --query-mode "captioning" \
    --dataset-name "coco" \
    --model-path "/path/to/llava/model" \
    --image-folder "./datasets/coco/val2017" \
    --image-anno-path "./datasets/coco/zero-shot/instances_val2017_all_2_oriorder.json" \
    --question-file "./stage1_questions/list_all_objects.jsonl" \
    --answers-folder "./stage1_answers/coco" \
    --answers-file "answered_annotations_coco" \
    --num-chunks 1 \
    --chunk-idx 1

2. Category Proposal

Generate relevant object categories using different methods:

Embedding-based proposal:

python run_stage1.py \
    --query-mode "proposing" \
    --pipeline-proposing "embedding" \
    --dataset-name "coco" \
    --embedding-model-name "sbert" \
    --embedding-model-size "sbert_base" \
    --proposing-thresh 0.15

LLM-based proposal:

python run_stage1.py \
    --query-mode "proposing" \
    --pipeline-proposing "llm" \
    --dataset-name "coco" \
    --llm-model-name "gpt-3.5-turbo-0125" \
    --llm-temperature 1.0

Tagging-based proposal:

python run_stage1.py \
    --query-mode "proposing" \
    --pipeline-proposing "tagging" \
    --dataset-name "coco" \
    --tags-file "datasets/tags/coco_rampp_tags_openset.json" \
    --pretrained-rampp-path "/path/to/rampp/model"

Stage 2: Merging and Synonym Generation

Merge Captioning Results

python run_stage1.py \
    --query-mode "merging" \
    --dataset-name "coco" \
    --answers-folder "./stage1_answers/coco" \
    --answers-file "answered_annotations_coco"

Add Synonyms to Vocabulary

python run_stage1.py \
    --query-mode "add_synonyms" \
    --dataset-name "coco" \
    --llm-model-name "gpt-3.5-turbo-0125" \
    --answers-folder "./stage1_answers/coco" \
    --answers-file "answered_annotations_coco"

Merge LLM Proposals

python run_stage1.py \
    --query-mode "merging_llm_proposals" \
    --dataset-name "coco" \
    --answers-folder "./stage1_answers/coco" \
    --answers-file "answered_annotations_coco"

Merge LLM Proposals with CLIP

python run_stage1.py \
    --query-mode "merging_llm_proposals_with_clip" \
    --dataset-name "coco" \
    --embedding-model-name "clip" \
    --embedding-model-size "ViT-L/14" \
    --proposing-thresh 0.15 \
    --answers-folder "./stage1_answers/coco" \
    --answers-file "answered_annotations_coco"

SpotDet Module Components

The SpotDet framework provides several key modules for vocabulary adaptation:

Captioner: Image captioning using vision-language models
Proposer: Category proposal methods (embedding, LLM, tagging)
Add Synonyms: Vocabulary expansion with synonyms
Merging: Combining different proposal methods

Example Scripts

The scripts_OV-COCO/ directory contains ready-to-use scripts for different experiments:

Stage1-a/: Captioning scripts
Stage1-b/: Proposal generation scripts

Example usage:

# Run captioning
bash scripts_OV-COCO/Stage1-a/a_chunk_1-10_llava_captioning.sh

# Run embedding-based proposal
bash scripts_OV-COCO/Stage1-b/b_chunk_1-10_embedding_proposing.sh

Project Structure

SpotDet/
├── SpotDet/                    # Core SpotDet framework
│   ├── captioner.py            # Image captioning modules
│   ├── proposer/              # Category proposal methods
│   │   ├── similarity_proposer.py    # Embedding-based proposals
│   │   ├── llm_proposer.py           # LLM-based proposals
│   │   ├── tagger_proposer.py        # Tagging-based proposals
│   │   └── gtruth_proposer.py        # Ground truth proposals
│   ├── add_synonyms.py         # Synonym generation
│   └── utils.py               # Utility functions
├── datasets/                  # Dataset configurations and metadata
├── scripts_OV-COCO/          # Experiment scripts
└── run_stage1.py             # Main vocabulary adaptation pipeline

Supported Datasets

COCO: COCO-80 (80 classes)
Objects365: Objects365 v2

Model Requirements

Vision-Language Models

LLaVA: LLaVA-1.6-Mistral-7B or LLaVA-1.6-34B
CLIP: ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336

Large Language Models

OpenAI GPT: GPT-3.5-turbo, GPT-4
Local LLMs: LLaMA3-8B, LLaMA3-70B

Embedding Models

CLIP: ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336
Sentence-BERT: sbert_mini, sbert_base, sbert_search
OpenAI Embeddings: emb3_small, emb3_large

Citation

@article{liu2025test,
  title={Test-time Vocabulary Adaptation for Language-driven Object Detection},
  author={Liu, Mingxuan and Hayes, Tyler L and Mancini, Massimiliano and Ricci, Elisa and Volpi, Riccardo and Csurka, Gabriela},
  journal={arXiv preprint arXiv:2506.00333},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
SpotDet		SpotDet
codet		codet
configs_codet		configs_codet
configs_detic		configs_detic
configs_detic_ConfThr		configs_detic_ConfThr
coraa		coraa
datasets		datasets
detic		detic
metadata		metadata
my_agents		my_agents
original_llava		original_llava
scripts_CDT-Objects365		scripts_CDT-Objects365
scripts_OV-COCO		scripts_OV-COCO
scripts_chaos		scripts_chaos
scripts_prob_ConfThr		scripts_prob_ConfThr
scripts_tools		scripts_tools
stage1_answers		stage1_answers
stage1_questions		stage1_questions
third_party/CenterNet2		third_party/CenterNet2
tools		tools
vldet		vldet
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_stage1.py		eval_stage1.py
req.txt		req.txt
run_add_synonyms.py		run_add_synonyms.py
run_stage1.py		run_stage1.py
train_net_codet.py		train_net_codet.py
train_net_detic.py		train_net_detic.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Test-Time Vocabulary Adaptation for Language-Driven Object Detection

Installation

Prerequisites

Setup Environment

Dataset Setup

Usage

Stage 1: Vocabulary Adaptation Pipeline

1. Image Captioning

2. Category Proposal

Stage 2: Merging and Synonym Generation

Merge Captioning Results

Add Synonyms to Vocabulary

Merge LLM Proposals

Merge LLM Proposals with CLIP

SpotDet Module Components

Example Scripts

Project Structure

Supported Datasets

Model Requirements

Vision-Language Models

Large Language Models

Embedding Models

Citation

About

Uh oh!

Releases

Packages

Languages

License

OatmealLiu/VocAda

Folders and files

Latest commit

History

Repository files navigation

Test-Time Vocabulary Adaptation for Language-Driven Object Detection

Installation

Prerequisites

Setup Environment

Dataset Setup

Usage

Stage 1: Vocabulary Adaptation Pipeline

1. Image Captioning

2. Category Proposal

Stage 2: Merging and Synonym Generation

Merge Captioning Results

Add Synonyms to Vocabulary

Merge LLM Proposals

Merge LLM Proposals with CLIP

SpotDet Module Components

Example Scripts

Project Structure

Supported Datasets

Model Requirements

Vision-Language Models

Large Language Models

Embedding Models

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages