Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reproduce the results #12

Open
cocoshe opened this issue Oct 1, 2024 · 4 comments
Open

Fail to reproduce the results #12

cocoshe opened this issue Oct 1, 2024 · 4 comments

Comments

@cocoshe
Copy link

cocoshe commented Oct 1, 2024

I try to reproduce the results with the following cmd, modified from the readme:

CUDA_VISIBLE_DEVICES=7,6,5,4 python train_net_dshmp.py   --config-file configs/dshmp_swin_tiny.yaml     --num-gpus 4 --dist-url auto     MODEL.WEIGHTS ./model_final_86143f.pkl     OUTPUT_DIR ./ckpt_reproduce_4_cards

Then inference with the cmd:

python train_net_dshmp.py     --config-file configs/dshmp_swin_tiny.yaml     --num-gpus 4 --dist-url auto --eval-only    
MODEL.WEIGHTS ./ckpt_reproduce_4_cards/model_final.pth     OUTPUT_DIR output_dshmp_4_card_valid_u  DATASETS.TEST '("mevis_val",)'

I check the offline score, on valid_u:

J: 0.48
F: 0.59
J&F: 0.54

And the online score, on valid, it turns out here:

QQ_1727786422689

The official ckpt turns out the online score, which is higher than my ckpt trained with the cmd in readme:
QQ_1727786449855

I am wondering what's wrong with my training process, I just follow the instruction in readme, but fail to reproduce the results

@heshuting555
Copy link
Owner

Thank you for your interest in our work!

You are using different numbers of GPUs, resulting in different batch sizes, so you need to adjust the learning rate!

@cocoshe
Copy link
Author

cocoshe commented Oct 1, 2024

Thank you for your interest in our work!

You are using different numbers of GPUs, resulting in different batch sizes, so you need to adjust the learning rate!

Thanks for your timely reply, and sorry for the missing information about my training device.

I noticed that you train the model on 8 x 3090 (in the readme), which takes about 17 hours, and I train on 4 devices.

However, if I just change the num-gpus 8 to num-gpus 4, something would go wrong since your training setting is something like "batch size is 8 on 8 devices means there is 1 sample in each device"(Is it right?). So in order to make the code run without throw error, I modified the config file configs/dshmp_swin_tiny.yaml.

Specifically, in dshmp_swin_tiny.yaml, I changed the IMS_PER_BATCH from 8 to 4, so I it can work with my 4 devices training.

SOLVER:
IMS_PER_BATCH: 8
BASE_LR: 0.00005
STEPS: (40000, 50000)
MAX_ITER: 55000
WARMUP_FACTOR: 1.0
WARMUP_ITERS: 10
WEIGHT_DECAY: 0.05
OPTIMIZER: "ADAMW"
BACKBONE_MULTIPLIER: 0.1
CLIP_GRADIENTS:
ENABLED: True
CLIP_TYPE: "full_model"
CLIP_VALUE: 0.01
NORM_TYPE: 2.0

So I'm not very sure about the "adjust the learning rate" since each device just only run 1 sample, which means batch size is 1 on each device all the time. Or you mean the lr is calculated by the "global batch" instead of the device "local batch"?

Can you give me more instructions on how to fix this? Thanks again~

@cocoshe
Copy link
Author

cocoshe commented Oct 1, 2024

Additionally, I just noticed the REFERENCE_WORLD_SIZE param and the auto_scale_workers func in d2 framework, the official doc is here

In a short, the auto_scale_workers can help with the hyperparameters adjustment automatically if REFERENCE_WORLD_SIZE is set.

if I get it right, it's better to set 8 as the REFERENCE_WORLD_SIZE in the SOLVER of dshmp_swin_tiny.yaml since your training setting is "total batch size is 8, and use 8 devices, each device deal with 1 sample(local batch size is 1)". So if someone use 4 devices for training, they don't need to change any hyperparameters manually, and can avoid some errors on params setting(someone just like me), REFERENCE_WORLD_SIZE: 8 indicate that you train with IMS_PER_BATCH: 8 on 8 devices explicitly, if the number of gpus changes, the params changes too.

Some questions about "use less than 8 devices for training" in this issue may no more a problem
#11

However, I just add REFERENCE_WORLD_SIZE: 8 to the config and started the training on 4 devices without reporting errors for a while, hours later I will check it again, and hope everything goes well.

If there is anything wrong with the above, please let me know, or any other advice in the process.

@cocoshe
Copy link
Author

cocoshe commented Oct 4, 2024

Additionally, I just noticed the REFERENCE_WORLD_SIZE param and the auto_scale_workers func in d2 framework, the official doc is here

In a short, the auto_scale_workers can help with the hyperparameters adjustment automatically if REFERENCE_WORLD_SIZE is set.

if I get it right, it's better to set 8 as the REFERENCE_WORLD_SIZE in the SOLVER of dshmp_swin_tiny.yaml since your training setting is "total batch size is 8, and use 8 devices, each device deal with 1 sample(local batch size is 1)". So if someone use 4 devices for training, they don't need to change any hyperparameters manually, and can avoid some errors on params setting(someone just like me), REFERENCE_WORLD_SIZE: 8 indicate that you train with IMS_PER_BATCH: 8 on 8 devices explicitly, if the number of gpus changes, the params changes too.

Some questions about "use less than 8 devices for training" in this issue may no more a problem #11

However, I just add REFERENCE_WORLD_SIZE: 8 to the config and started the training on 4 devices without reporting errors for a while, hours later I will check it again, and hope everything goes well.

If there is anything wrong with the above, please let me know, or any other advice in the process.

The results turns out to be:

offline:

J: 0.5016960205487214
F: 0.6057336038759262
J&F: 0.5537148122123239
time: 141.9649 s

online:

0.4510879405

< 0.46 (reported in paper)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants