Code for the paper Learning sparse features can lead to overfitting in neural networks appeared at NeurIPS 2022.
Details for running experiments can be found in experiments.md
.
Run main.py
Architecture. Train a one hidden-layer fully-connected neural network on the mean square error.
Data-set. Data samples x
can be drawn from
- the
d
-dimensional normal distribution,args.dataset = normal
; - uniform distribution inside the sphere,
uniform
; - or uniform distribution on the spherical surface,
sphere
.
The target is either
- the sample norm
||x||
,args.target = norm
; - or a Gaussian random field computed through an ~infinite-width teacher network (
args.target = teacher
) withrelu
orabs
activation function to some powera
. Here examples of the Gaussian random field defined on the sphere ind=3
:
Algorithm. Full batch gradient descent can be performed with
- the alpha-trick by setting
args.alpha
larger (lazy) or smaller (feature) than one. - a regularization
args.l
on the l2 norm of the parameters (args.reg = 'l2'
), on the path norm||w1|| * |w2|
(args.reg = 'l1'
) or on the l1 norm|w2|
by fixing the first layer weights on the unit sphere||w1|| = 1
(args.reg = 'l1'
andargs.w1_norm1 = 1
).
Additionally, conic gradient descent [Chizat and Bach, 2018] can be performed by setting args.conic_gd = 1
.
Run main_krr.py
Student kernel. Analytical NTK of an infinite-width one-hidden-layer neural network.
Data-set. see above.
Ridge. The regularization or ridge parameter can be fixed by args.l
(default: 0
).