-
Notifications
You must be signed in to change notification settings - Fork 43
Labels
infraIssues related to infrastructureIssues related to infrastructuremodelRelated to model training or definition (not generic infra)Related to model training or definition (not generic infra)
Milestone
Description
Is your feature request related to a problem? Please describe.
The current codebase is organized around both epochs and steps, but there are a couple of issues:
- cf.istep is injected in the config as an increment of the batch_size_per_gpu . Why batch_size_per_gpu and not just +1?
- epoch is defined in the literature as a full pass over the dataset, but we hardcode the data length as 4096
- the metrics pipeline does not log the steps
- restarting a run should pick up from the right step (verify that it is the case)
The goal of this task is:
- alignment on how we measure the training progress: steps or epochs? steps seem to be the default given the size of the dataset. Should steps be invariant to the number of GPUs + number of nodes? I think so, but this is not how the code is implemented
- implement these decisinos
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
Organisation
No response
Metadata
Metadata
Assignees
Labels
infraIssues related to infrastructureIssues related to infrastructuremodelRelated to model training or definition (not generic infra)Related to model training or definition (not generic infra)
Type
Projects
Status
Todo