Skip to content

Latest commit

Β 

History

History
214 lines (195 loc) Β· 9.81 KB

TRAIN_AND_VALIDATE.md

File metadata and controls

214 lines (195 loc) Β· 9.81 KB

We provide the off-the-shelf scripts in the scripts folder.

Training LanguageBind

Cache of pretrained weightBaidu YunGoogle CloudPeking University Yun
LargeLinkLinkLink
HugeLink-Link

For example, to train LanguageBind on Depth-Language with 8 GPUs (1 nodes x 8 GPUs).

  • First download the cache of pretrained weight above. and specify CACHE_DIR=path/to/LanguageBind.
  • The second step is to develop a path to ANNOTATION and DATA here according to the dataset preparation.
  • Then you can run
CACHE_DIR="/path/to/LanguageBind"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --nproc_per_node 8 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --do_train \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_d_cls_data "NYUV2"

Validating LanguageBind

For example, to validate LanguageBind on Depth-Language with 1 GPUs.

  • First specify RESUME.
  • The second step is to prepare the downstream dataset.
  • Then you can run
CACHE_DIR="/path/to/LanguageBind"
RESUME="thermal_language.pt"
ANNOTATION="path/to/data"
cd /path/to/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "dl" --max-depth 10 \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 2 \
    --lr 5e-4 --coef-lr 1e-3 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 1 --force-patch-dropout 0.5 \
    --epochs 1 --batch-size 128 --accum-freq 1 --warmup 200 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume ${RESUME} \
    --do_eval \
    --val_d_cls_data "NYUV2"

Downstream datasets

Depth

NYU V2 dataset is downloaded from this repo and we reformat them to conform to the standard ImageNet format. We also provide data as follows. Change the data_root here.

DatasetsBaidu YunGoogle CloudPeking University Yun
NYULinkLinkLink

Video

Video datasets are downloaded from this repo and we show the folder structure. Change the data_root here.

Audio

Audio datasets are downloaded from this repo and Audioset from here.We reformat them to conform to the standard ImageNet format. Change the data_root here1 and here2.

Infrared (Thermal)

We download LLVIP from official website, and FLIR from here. We reformat them to conform to the standard ImageNet format. Change the data_root here. We also provide the processed data as follows.

DatasetsBaidu YunGoogle CloudPeking University Yun
LLVIPLinkLinkLink
FLIR V1LinkLinkLink
FLIR V2LinkLinkLink

Folder structure

downstream_datasets
β”œβ”€β”€ Audio
β”‚Β Β  β”œβ”€β”€ audiocaps
β”‚Β Β  β”‚   └── audio
β”‚Β Β  β”‚       β”œβ”€β”€ test
β”‚Β Β  β”‚       β”œβ”€β”€ train
β”‚Β Β  β”‚       └── val
β”‚   β”œβ”€β”€ audioset
β”‚Β Β  β”‚   β”œβ”€β”€ balanced_train_segments
β”‚Β Β  β”‚   β”œβ”€β”€ eval_segments
β”‚Β Β  β”‚   └── unbalanced_train_segments
β”‚Β Β  β”‚       β”œβ”€β”€ unbalanced_train_segments_part00
β”‚Β Β  β”‚       β”œβ”€β”€ unbalanced_train_segments_part01
β”‚Β Β  β”‚       β”œβ”€β”€ ...
β”‚Β Β  β”‚       └── unbalanced_train_segments_part40
β”‚   β”œβ”€β”€ clotho
β”‚Β Β  β”‚   β”œβ”€β”€ CLOTHO_retrieval_dataset
β”‚Β Β  β”‚   └── evaluation
β”‚   β”œβ”€β”€ esc50
β”‚Β Β  β”‚   └── test
β”‚Β Β  β”‚       β”œβ”€β”€ airplane
β”‚Β Β  β”‚       β”œβ”€β”€ breathing
β”‚Β Β  β”‚       β”œβ”€β”€ ...
β”‚Β Β  β”‚       └── wind
β”œβ”€β”€ laionaudio
β”‚Β Β  β”‚   β”œβ”€β”€ audios
β”‚Β Β  β”‚   β”œβ”€β”€ freesound_no_overlap
β”‚Β Β  β”‚   └── jsons
β”œβ”€β”€ vggsound
β”‚       └── test
β”‚           β”œβ”€β”€ air\ conditioning\ noise
β”‚           β”œβ”€β”€ air\ horn
β”‚           β”œβ”€β”€ ...
β”‚           └── zebra\ braying
β”œβ”€β”€ Depth
β”‚Β Β  β”œβ”€β”€ nyuv2
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ data
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── val
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bathroom
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bedroom
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bookstore
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ classroom
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ dining_room
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ home_office
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ kitchen
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ living_room
β”‚Β Β  β”‚Β Β  β”‚Β Β      β”œβ”€β”€ office
β”‚Β Β  β”‚Β Β  β”‚Β Β      └── others
β”œβ”€β”€ Thermal
β”‚Β Β  β”œβ”€β”€ flirv1
β”‚Β Β  β”‚Β Β  └── val
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bicycle
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ car
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ dog
β”‚Β Β  β”‚Β Β      └── person
β”‚Β Β  β”œβ”€β”€ flirv2
β”‚Β Β  β”‚Β Β  └── val
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bike
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ bus
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ car
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ hydrant
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ light
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ motor
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ other\ vehicle
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ person
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ sign
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ skateboard
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ stroller
β”‚Β Β  β”‚Β Β      └── truck
β”‚Β Β  β”œβ”€β”€ llvip
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ train
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ background
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── person
β”‚Β Β  β”‚Β Β  └── val
β”‚Β Β  β”‚Β Β      β”œβ”€β”€ background
β”‚Β Β  β”‚Β Β      └── person
└── VideoTextRetrieval
    β”œβ”€β”€ vtRetdata
    β”‚Β Β  β”œβ”€β”€ ActivityNet
    β”‚Β Β  β”‚Β Β  └── Videos
    β”‚Β Β  β”‚Β Β      └── Activity_Videos
    β”‚Β Β  β”œβ”€β”€ Didemo
    β”‚Β Β  β”‚Β Β  └── videos
    β”‚Β Β  β”œβ”€β”€ MSRVTT
    β”‚Β Β  β”‚Β Β  └── MSRVTT_Videos
    β”‚Β Β  └── MSVD
    β”‚Β Β      └── MSVD_Videos