-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
In both of the jupyter notebooks and the paper, I noticed that instead of using Adam, the most commonly used optimizer for transformers, you used Adagrad for all of the experiments. Is there a reason behind this or simply a empirical observation?
Additionally, are other newly developed optimizers (RAdam, NovoGrad, DiffGrad, etc.) compatible with the method introduced or do they defeat the purpose?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels