Use steps instead of epochs as a basic progress unit

### Is your feature request related to a problem? Please describe.

The current codebase is organized around both epochs and steps, but there are a couple of issues: 
- cf.istep is injected in the config as an increment of the batch_size_per_gpu . Why batch_size_per_gpu and not just +1?
- epoch is defined in the literature as a full pass over the dataset, but we hardcode the data length as 4096
- the metrics pipeline does not log the steps
- restarting a run should pick up from the right step (verify that it is the case)

The goal of this task is:
- alignment on how we measure the training progress: steps or epochs? steps seem to be the default given the size of the dataset. Should steps be invariant to the number of GPUs + number of nodes? I think so, but this is not how the code is implemented
- implement these decisinos


### Describe the solution you'd like

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

### Organisation

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use steps instead of epochs as a basic progress unit #515

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use steps instead of epochs as a basic progress unit #515

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions