Skip to content

Use steps instead of epochs as a basic progress unit #515

@tjhunter

Description

@tjhunter

Is your feature request related to a problem? Please describe.

The current codebase is organized around both epochs and steps, but there are a couple of issues:

  • cf.istep is injected in the config as an increment of the batch_size_per_gpu . Why batch_size_per_gpu and not just +1?
  • epoch is defined in the literature as a full pass over the dataset, but we hardcode the data length as 4096
  • the metrics pipeline does not log the steps
  • restarting a run should pick up from the right step (verify that it is the case)

The goal of this task is:

  • alignment on how we measure the training progress: steps or epochs? steps seem to be the default given the size of the dataset. Should steps be invariant to the number of GPUs + number of nodes? I think so, but this is not how the code is implemented
  • implement these decisinos

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Organisation

No response

Metadata

Metadata

Assignees

Labels

infraIssues related to infrastructuremodelRelated to model training or definition (not generic infra)

Type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions