Invalid gradient when finetuning and learning rate with gradient clip setting #65

skill-diver · 2024-07-25T16:10:50Z

Hi Author,

Thank you for sharing this project and for your kindness for answering my previous questions. I have some of questions want to ask about training:

What is your default learning rate and gradient clip setting when training from scratch?
I tried to replace the dino part in the encoder with another VIT and the performance got bad. So, I decide to finetune your weights. But I will get an invalid gradient if I use the code's learning rate and gradient clip now. So I choose unfreeze layer training during epochs to solve this problem (But the performance increase is really slow).
Would you happen to have some better suggestions to avoid the invalid gradient when finetuning your model with new VIT? My current idea is train from scratch but does that need to spend too much time?

Thank you so much.

Parskatt · 2024-07-25T16:29:46Z

Its different depending on the encoder and decoder, the settings should be in the train experiment. Grad clip is 0.01 I think. Basically you can set grad clip thr super low so all gradients are clipped. This helped a bit with stability.
The model is trained with a step lr, so that at the end the lr is /10 the original onr. If you want to finetune, I suggest that lr.
It's probably difficult to replace the vit without scratch since the features will be different.

Sorry for lack of detail, on my phone and cant check stuff right now.

If you have issues with stability, you could check which params give nans and manually use fp32 there.

You might also want to freeze the batchnorm of the network, ive found the batchnorm can cause a lot of issues.

skill-diver · 2024-07-27T00:30:28Z

How many days you spend to train the roma model? I also find if I replace the dino with other vit the training result is bad

Devoe-97 · 2024-11-11T11:22:37Z

How many days you spend to train the roma model? I also find if I replace the dino with other vit the training result is bad

I'm stuck with the same problem, do you have any ideas on how to solve it?

Parskatt · 2024-11-11T11:24:18Z

It was trained for 4 days with 4 A100 GPUs. You can also avoid issues by using bfloat16 instead of float16.

Devoe-97 · 2024-11-11T11:27:18Z

3. My current idea is train from scratch

Gradient is NAN when training from scratch, is there any solution to this?

Provide feedback