Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enumerate and make plan to develop strategies for regularizing "paths" in network #174

Open
blondegeek opened this issue Jan 8, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@blondegeek
Copy link
Member

As per discussion in the January e3nn meeting, it is currently difficult to train models with large L. We suspect this is a due to different paths in the network (in1 x in2 -> out... wash, rinse, repeat) having very different sensitivities to inputs and convolutions and requires rigorous regularization.

Several strategies were suggested:

  • Choose different learning rates for parameters of different paths of or output L (@mariogeiger)
  • Change learning rates in time for different paths (@mariogeiger)
  • Initialize weights based on path (@mariogeiger)
    • e.g. Start with a purely scalar network that learns to include higher tensor contributions
  • Start off with only scalar network, train and then gradually add higher L's (@JoshRackers and @muhrin)

Please add to the thread if I missed anything.

@blondegeek blondegeek added the enhancement New feature or request label Jan 8, 2021
@muhrin
Copy link
Contributor

muhrin commented Jan 9, 2021

Thanks Tess! One more idea that comes to mind from a similar discussion I had with a friend some time ago.
(This may be obvious to people but I mainly flesh it out so I have a record of my thoughts).

If we think of training a neural network as a 'maximum likelyhood learning' procedure we are finding the weights w* that maximise the probability of observing the training data given the weights, p(D|w*).

But it's also possible to use Bayes theorem to to think in terms of a distribution of weights, w, in which case we have:

p(w|D) = p(D|w) p(w) / p(D)

typically when people play this game they use p(w) = exp(-a E_w) / Z_w(a) and then argue that smaller weights should generalise better and so use E_w = 1/2 \sum_i^W w_i^2 which makes p(w) a Gaussian with mean zero (Z_w is just a normaliser). But there's no reason why E_w has to be the same for all weights and, in fact, we're trying to make the prior that lower L value weights should be more important and so we could use a Gaussians with higher means for lower values of L.

At the moment, I can't think of an intelligent way to set the means and perhaps they needn't be fixed but could be hyperparameters themselves related in a specific way. One appealing relation (although I admit I can't think exactly how this would work exactly) is a type of 'pcinciple component analysis' between the weights and the posterior quantity. Specifically, we want L=0 weights to align with the principle component and, therefore, explain most of the variance in the posterior, L=1 the second component...well you get the picture.

Anyway. Just a brain dump for now, but I'm happy to discuss further is anyone is interested in fleshing this out.

P.S. I found this source useful for understanding Bayesian methods for NN: https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/bayesian.pdf

@mariogeiger
Copy link
Member

I share this statistical prior point of view.

In the same time I think if the network is deep we need to ensure that information propagate well through it, as explained in the paper deep information propagation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants