Enumerate and make plan to develop strategies for regularizing "paths" in network #174

blondegeek · 2021-01-08T21:19:10Z

As per discussion in the January e3nn meeting, it is currently difficult to train models with large L. We suspect this is a due to different paths in the network (in1 x in2 -> out... wash, rinse, repeat) having very different sensitivities to inputs and convolutions and requires rigorous regularization.

Several strategies were suggested:

Choose different learning rates for parameters of different paths of or output L (@mariogeiger)
Change learning rates in time for different paths (@mariogeiger)
Initialize weights based on path (@mariogeiger)
- e.g. Start with a purely scalar network that learns to include higher tensor contributions
Start off with only scalar network, train and then gradually add higher L's (@JoshRackers and @muhrin)

Please add to the thread if I missed anything.

muhrin · 2021-01-09T13:48:01Z

Thanks Tess! One more idea that comes to mind from a similar discussion I had with a friend some time ago.
(This may be obvious to people but I mainly flesh it out so I have a record of my thoughts).

If we think of training a neural network as a 'maximum likelyhood learning' procedure we are finding the weights w* that maximise the probability of observing the training data given the weights, p(D|w*).

But it's also possible to use Bayes theorem to to think in terms of a distribution of weights, w, in which case we have:

p(w|D) = p(D|w) p(w) / p(D)

typically when people play this game they use p(w) = exp(-a E_w) / Z_w(a) and then argue that smaller weights should generalise better and so use E_w = 1/2 \sum_i^W w_i^2 which makes p(w) a Gaussian with mean zero (Z_w is just a normaliser). But there's no reason why E_w has to be the same for all weights and, in fact, we're trying to make the prior that lower L value weights should be more important and so we could use a Gaussians with higher means for lower values of L.

At the moment, I can't think of an intelligent way to set the means and perhaps they needn't be fixed but could be hyperparameters themselves related in a specific way. One appealing relation (although I admit I can't think exactly how this would work exactly) is a type of 'pcinciple component analysis' between the weights and the posterior quantity. Specifically, we want L=0 weights to align with the principle component and, therefore, explain most of the variance in the posterior, L=1 the second component...well you get the picture.

Anyway. Just a brain dump for now, but I'm happy to discuss further is anyone is interested in fleshing this out.

P.S. I found this source useful for understanding Bayesian methods for NN: https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf

mariogeiger · 2021-01-10T11:31:06Z

I share this statistical prior point of view.

In the same time I think if the network is deep we need to ensure that information propagate well through it, as explained in the paper deep information propagation.

blondegeek added the enhancement New feature or request label Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enumerate and make plan to develop strategies for regularizing "paths" in network #174

Enumerate and make plan to develop strategies for regularizing "paths" in network #174

blondegeek commented Jan 8, 2021

muhrin commented Jan 9, 2021

mariogeiger commented Jan 10, 2021

Enumerate and make plan to develop strategies for regularizing "paths" in network #174

Enumerate and make plan to develop strategies for regularizing "paths" in network #174

Comments

blondegeek commented Jan 8, 2021

muhrin commented Jan 9, 2021

mariogeiger commented Jan 10, 2021