This file documents a collection of models reported in our paper. Training in all cases is done with 4 32GB V100 GPUs.
The "Name" column contains a link to the config file. To train a model, run
python train_net_auto.py --num-gpus 4 --config-file /path/to/config/name.yaml
To evaluate a model with a trained/ pretrained model, run
python train_net_auto.py --num-gpus 4 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth
Name | APr | mAP | Weights |
---|---|---|---|
lvis-base_r50_4x_clip_gpt3_descriptions | 19.3 | 30.3 | model |
lvis-base_r50_4x_clip_image_exemplars_avg | 14.8 | 28.8 | model |
lvis-base_r50_4x_clip_image_exemplars_agg | 18.3 | 29.2 | model |
lvis-base_r50_4x_clip_multi_modal_avg | 20.7 | 30.5 | model |
lvis-base_r50_4x_clip_multi_modal_agg | 19.2 | 30.6 | model |
lvis-base_in-l_r50_4x_4x_clip_gpt3_descriptions | 25.8 | 32.6 | model |
lvis-base_in-l_r50_4x_4x_clip_image_exemplars_avg | 21.6 | 31.3 | model |
lvis-base_in-l_r50_4x_4x_clip_image_exemplars_agg | 23.8 | 31.3 | model |
lvis-base_in-l_r50_4x_4x_clip_multi_modal_avg | 26.5 | 32.8 | model |
lvis-base_in-l_r50_4x_4x_clip_multi_modal_agg | 27.3 | 33.1 | model |
-
The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.
-
All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.
-
The models with
in-l
use the overlap classes between ImageNet-21K and LVIS as image-labeled data. -
The models which are trained on
in-l
require the corresponding models withoutin-l
(indicated by MODEL.WEIGHTS in the config files). Please train or download the model withoutin-l
and place them under${mm-ovod_ROOT}/output/..
before training the model usingin-l
(check the config file).