You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, in my experiment, I used Moving-MNIST dataset. But here are my problems during training that I couldn't find an answer:
I tried to play with a small network by using only num_latent_scale=1 and num_groups_per_scale=1. Then I realized there were no gradients generated for parameters including prior.ftr0 and an error was given to stop the training.
If I increase num_groups_per_scale from 1 to 2 or more, I still got Nan in some of the gradients in the first iteration, then they went away, but the training continues without errors.
I'm wondering if you could provide some hint or clue to why such behavior happens? Thank you in advance!
The text was updated successfully, but these errors were encountered:
Hi, getting no gradient for num_latent_scale=1 and num_groups_per_scale=1 is weird. By no gradients, do you mean that the gradients were zero or None? If they were zero, do you see any changes after some time of training?
Getting NaN in gradient is natural especially at the beginning of the training. We are using mixed precision which means that most operations are cast to FP16. Because of the lower precision, we may get NaN easily and it's autocast and grad_scalar's job to drop these gradients and scale the loss such that we don't get NaN.
You can disable mixed-precision by supplying enabled=False to autocast() at this line:
Hi, in my experiment, I used Moving-MNIST dataset. But here are my problems during training that I couldn't find an answer:
I tried to play with a small network by using only num_latent_scale=1 and num_groups_per_scale=1. Then I realized there were no gradients generated for parameters including prior.ftr0 and an error was given to stop the training.
If I increase num_groups_per_scale from 1 to 2 or more, I still got Nan in some of the gradients in the first iteration, then they went away, but the training continues without errors.
I'm wondering if you could provide some hint or clue to why such behavior happens? Thank you in advance!
The text was updated successfully, but these errors were encountered: