You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per discussion in the January e3nn meeting, it is currently difficult to train models with large L. We suspect this is a due to different paths in the network (in1 x in2 -> out... wash, rinse, repeat) having very different sensitivities to inputs and convolutions and requires rigorous regularization.
Several strategies were suggested:
Choose different learning rates for parameters of different paths of or output L (@mariogeiger)
Change learning rates in time for different paths (@mariogeiger)
Thanks Tess! One more idea that comes to mind from a similar discussion I had with a friend some time ago.
(This may be obvious to people but I mainly flesh it out so I have a record of my thoughts).
If we think of training a neural network as a 'maximum likelyhood learning' procedure we are finding the weights w* that maximise the probability of observing the training data given the weights, p(D|w*).
But it's also possible to use Bayes theorem to to think in terms of a distribution of weights, w, in which case we have:
p(w|D) = p(D|w) p(w) / p(D)
typically when people play this game they use p(w) = exp(-a E_w) / Z_w(a) and then argue that smaller weights should generalise better and so use E_w = 1/2 \sum_i^W w_i^2 which makes p(w) a Gaussian with mean zero (Z_w is just a normaliser). But there's no reason why E_w has to be the same for all weights and, in fact, we're trying to make the prior that lower L value weights should be more important and so we could use a Gaussians with higher means for lower values of L.
At the moment, I can't think of an intelligent way to set the means and perhaps they needn't be fixed but could be hyperparameters themselves related in a specific way. One appealing relation (although I admit I can't think exactly how this would work exactly) is a type of 'pcinciple component analysis' between the weights and the posterior quantity. Specifically, we want L=0 weights to align with the principle component and, therefore, explain most of the variance in the posterior, L=1 the second component...well you get the picture.
Anyway. Just a brain dump for now, but I'm happy to discuss further is anyone is interested in fleshing this out.
In the same time I think if the network is deep we need to ensure that information propagate well through it, as explained in the paper deep information propagation.
As per discussion in the January e3nn meeting, it is currently difficult to train models with large L. We suspect this is a due to different paths in the network (in1 x in2 -> out... wash, rinse, repeat) having very different sensitivities to inputs and convolutions and requires rigorous regularization.
Several strategies were suggested:
Please add to the thread if I missed anything.
The text was updated successfully, but these errors were encountered: