Skip to content

ETA calculation is inaccurate #55

Open
@felker

Description

@felker
Member

Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):

[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]

The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:

  1. The ETA computed in the first step of any epoch is always inaccurate.
  2. For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically.

First step

For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far is zero, resulting in work_so_far of 0 being passed to:

def estimate_remaining_time(self, time_so_far, work_so_far, work_total):
eps = 1e-6
total_time = 1.0*time_so_far*work_total/(work_so_far + eps)
return total_time - time_so_far

causing total_time to explode.

  • Probably should just refuse to give an ETA for the first step (or steps) of the first epoch

For later epochs within a session, it gives a minuscule ETA:

step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
  • I think an error was introduced when I changed the 0-based indexing of the epochs 1-2 months ago.

Later steps in later epochs

E.g. here are the ETAs for some later epoch:


ETA: 0.55s
ETA: 22.14
ETA: 27.98
ETA: 31.63
ETA: 35.88
ETA: 38.45
ETA: 34.89
ETA: 36.21
ETA: 35.35
ETA: 35.56
ETA: 36.04
ETA: 35.88
ETA: 35.33
ETA: 34.49
ETA: 34.73
ETA: 34.29
ETA: 34.13
ETA: 33.51
ETA: 33.16
…
ETA: 1.35s
ETA: 1.06s
ETA: 0.67s
ETA: 0.11s
ETA: -0.45
  • Consider using the measured runtimes of the previous epochs within this session to inform the ETA in later epochs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @felker

        Issue actions

          ETA calculation is inaccurate · Issue #55 · PPPLDeepLearning/plasma-python