Here we provide code and some basic instructions to reproduce some of the baselines on CATER.
Generate or copy over the pre-generated data to a fast I/O directory. unzip
the file. For the non-local code, we need to generate LMDBs. Do so by:
$ cd baselines/video-nonlocal-net
$ vim process_data/cater/gen_lmdbs.py # <-- change the path to the lists folder for which to generate lmdb
$ python process_data/cater/gen_lmdbs.py # <-- This will generate the LMDBs in the same folder as lists
The training follows the same paradigm as the non local codebase. We provide sample .yaml
config files in the configs_cater
folder. To launch training, we provide a nifty little launch.py
script. Again, you'll likely need to edit it to point it to the right paths (if needed). The pretrained model used for initialization can be downloaded from the non-local codebase [direct link]. Then, launch the training as follows:
$ vim configs_cater/001_I3D_NL_localize_imagenetPretrained_32f_8SR.yaml # <-- change the first 2 lines to point to the data as processed above
$ python launch.py -c configs_cater/001_I3D_NL_localize_imagenetPretrained_32f_8SR.yaml
This would run the training and testing. The final trained output model for this run available is here. Configs for other experiments can also be created based on provided config, and can be run in the same way.
The above experiment uses the standard non-local testing paradigm: 30-crop average. However, for a task like CATER, training a temporal model on top of the clip features might make more sense than averaging the features. Hence, we also try a LSTM-based aggregation, which can be run as follows (after training the non-local model):
$ python launch.py -c configs_cater/001_I3D_NL_localize_imagenetPretrained_32f_8SR.yaml -t test # <-- Test the model and store the features. These
$ python launch.py -c configs_cater/001_I3D_NL_localize_imagenetPretrained_32f_8SR.yaml -t lstm # <-- To train/test the LSTM. We saw some random variation in LSTM training so this script trains/tests the model 3 times and averages the numbers for a more stable estimate of performance
The expected performance for these models as in the paper is in this table (please expect some random variation across training runs):
Expt | Config | Localize (top-1 accuracy) |
---|---|---|
R3D + NL, 32 frame | configs_cater/001_I3D_NL_localize_imagenetPretrained_32f_8SR.yaml | 28.8 |
R3D + NL, 32 frame, LSTM | configs_cater/001_I3D_NL_localize_imagenetPretrained_32f_8SR.yaml | 45.5 |
This code was tested using pytorch 0.4
. Set the paths in main.py
and run it to run the tracking baseline. It uses DaSiamRPN codebase with pretrained model.
$ cd tracking
$ source activate pytorch0.4 # <-- Need pytorch0.4, OpenCV (with ffmpeg) installed
$ python main.py # <-- To run the tracking. Change parameters in the
Expt | Config | Localize (top-1 accuracy) |
---|---|---|
Tracking | - | 33.9 |