Skip to content

Optimizer choices? #2

@Akamight

Description

@Akamight

In both of the jupyter notebooks and the paper, I noticed that instead of using Adam, the most commonly used optimizer for transformers, you used Adagrad for all of the experiments. Is there a reason behind this or simply a empirical observation?

Additionally, are other newly developed optimizers (RAdam, NovoGrad, DiffGrad, etc.) compatible with the method introduced or do they defeat the purpose?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions