Fail to reproduce the results #12

cocoshe · 2024-10-01T12:43:30Z

I try to reproduce the results with the following cmd, modified from the readme:

CUDA_VISIBLE_DEVICES=7,6,5,4 python train_net_dshmp.py   --config-file configs/dshmp_swin_tiny.yaml     --num-gpus 4 --dist-url auto     MODEL.WEIGHTS ./model_final_86143f.pkl     OUTPUT_DIR ./ckpt_reproduce_4_cards

Then inference with the cmd:

python train_net_dshmp.py     --config-file configs/dshmp_swin_tiny.yaml     --num-gpus 4 --dist-url auto --eval-only    
MODEL.WEIGHTS ./ckpt_reproduce_4_cards/model_final.pth     OUTPUT_DIR output_dshmp_4_card_valid_u  DATASETS.TEST '("mevis_val",)'

I check the offline score, on valid_u:

J: 0.48
F: 0.59
J&F: 0.54

And the online score, on valid, it turns out here:

The official ckpt turns out the online score, which is higher than my ckpt trained with the cmd in readme:

I am wondering what's wrong with my training process, I just follow the instruction in readme, but fail to reproduce the results

The text was updated successfully, but these errors were encountered:

heshuting555 · 2024-10-01T14:38:50Z

Thank you for your interest in our work!

You are using different numbers of GPUs, resulting in different batch sizes, so you need to adjust the learning rate!

cocoshe · 2024-10-01T16:56:34Z

Thank you for your interest in our work!

You are using different numbers of GPUs, resulting in different batch sizes, so you need to adjust the learning rate!

Thanks for your timely reply, and sorry for the missing information about my training device.

I noticed that you train the model on 8 x 3090 (in the readme), which takes about 17 hours, and I train on 4 devices.

However, if I just change the num-gpus 8 to num-gpus 4, something would go wrong since your training setting is something like "batch size is 8 on 8 devices means there is 1 sample in each device"(Is it right?). So in order to make the code run without throw error, I modified the config file configs/dshmp_swin_tiny.yaml.

Specifically, in dshmp_swin_tiny.yaml, I changed the IMS_PER_BATCH from 8 to 4, so I it can work with my 4 devices training.

DsHmp/configs/dshmp_swin_tiny.yaml

Lines 67 to 81 in d0c3a39

    
           SOLVER: 
        
             IMS_PER_BATCH: 8 
        
             BASE_LR: 0.00005 
        
             STEPS: (40000, 50000) 
        
             MAX_ITER: 55000 
        
             WARMUP_FACTOR: 1.0 
        
             WARMUP_ITERS: 10 
        
             WEIGHT_DECAY: 0.05 
        
             OPTIMIZER: "ADAMW" 
        
             BACKBONE_MULTIPLIER: 0.1 
        
             CLIP_GRADIENTS: 
        
               ENABLED: True 
        
               CLIP_TYPE: "full_model" 
        
               CLIP_VALUE: 0.01 
        
               NORM_TYPE: 2.0

So I'm not very sure about the "adjust the learning rate" since each device just only run 1 sample, which means batch size is 1 on each device all the time. Or you mean the lr is calculated by the "global batch" instead of the device "local batch"?

Can you give me more instructions on how to fix this? Thanks again~

cocoshe · 2024-10-01T20:26:20Z

Additionally, I just noticed the REFERENCE_WORLD_SIZE param and the auto_scale_workers func in d2 framework, the official doc is here

In a short, the auto_scale_workers can help with the hyperparameters adjustment automatically if REFERENCE_WORLD_SIZE is set.

if I get it right, it's better to set 8 as the REFERENCE_WORLD_SIZE in the SOLVER of dshmp_swin_tiny.yaml since your training setting is "total batch size is 8, and use 8 devices, each device deal with 1 sample(local batch size is 1)". So if someone use 4 devices for training, they don't need to change any hyperparameters manually, and can avoid some errors on params setting(someone just like me), REFERENCE_WORLD_SIZE: 8 indicate that you train with IMS_PER_BATCH: 8 on 8 devices explicitly, if the number of gpus changes, the params changes too.

Some questions about "use less than 8 devices for training" in this issue may no more a problem
#11

However, I just add REFERENCE_WORLD_SIZE: 8 to the config and started the training on 4 devices without reporting errors for a while, hours later I will check it again, and hope everything goes well.

If there is anything wrong with the above, please let me know, or any other advice in the process.

cocoshe · 2024-10-04T07:27:01Z

Additionally, I just noticed the REFERENCE_WORLD_SIZE param and the auto_scale_workers func in d2 framework, the official doc is here

In a short, the auto_scale_workers can help with the hyperparameters adjustment automatically if REFERENCE_WORLD_SIZE is set.

if I get it right, it's better to set 8 as the REFERENCE_WORLD_SIZE in the SOLVER of dshmp_swin_tiny.yaml since your training setting is "total batch size is 8, and use 8 devices, each device deal with 1 sample(local batch size is 1)". So if someone use 4 devices for training, they don't need to change any hyperparameters manually, and can avoid some errors on params setting(someone just like me), REFERENCE_WORLD_SIZE: 8 indicate that you train with IMS_PER_BATCH: 8 on 8 devices explicitly, if the number of gpus changes, the params changes too.

Some questions about "use less than 8 devices for training" in this issue may no more a problem #11

However, I just add REFERENCE_WORLD_SIZE: 8 to the config and started the training on 4 devices without reporting errors for a while, hours later I will check it again, and hope everything goes well.

If there is anything wrong with the above, please let me know, or any other advice in the process.

The results turns out to be:

offline:

J: 0.5016960205487214
F: 0.6057336038759262
J&F: 0.5537148122123239
time: 141.9649 s

online:

0.4510879405

< 0.46 (reported in paper)

haliphinx mentioned this issue Dec 13, 2024

Ref YouTube VOS and A2D-Sentences #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to reproduce the results #12

Fail to reproduce the results #12

cocoshe commented Oct 1, 2024

heshuting555 commented Oct 1, 2024

cocoshe commented Oct 1, 2024

cocoshe commented Oct 1, 2024

cocoshe commented Oct 4, 2024

Fail to reproduce the results #12

Fail to reproduce the results #12

Comments

cocoshe commented Oct 1, 2024

heshuting555 commented Oct 1, 2024

cocoshe commented Oct 1, 2024

cocoshe commented Oct 1, 2024

cocoshe commented Oct 4, 2024