Open
Description
Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):
[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]
The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:
- The ETA computed in the first step of any epoch is always inaccurate.
- For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically.
First step
For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far
is zero, resulting in work_so_far
of 0 being passed to:
plasma-python/plasma/models/mpi_runner.py
Lines 613 to 616 in c82ba61
causing
total_time
to explode.
- Probably should just refuse to give an ETA for the first step (or steps) of the first epoch
For later epochs within a session, it gives a minuscule ETA:
step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
- I think an error was introduced when I changed the 0-based indexing of the epochs 1-2 months ago.
Later steps in later epochs
E.g. here are the ETAs for some later epoch:
ETA: 0.55s
ETA: 22.14
ETA: 27.98
ETA: 31.63
ETA: 35.88
ETA: 38.45
ETA: 34.89
ETA: 36.21
ETA: 35.35
ETA: 35.56
ETA: 36.04
ETA: 35.88
ETA: 35.33
ETA: 34.49
ETA: 34.73
ETA: 34.29
ETA: 34.13
ETA: 33.51
ETA: 33.16
…
ETA: 1.35s
ETA: 1.06s
ETA: 0.67s
ETA: 0.11s
ETA: -0.45
- Consider using the measured runtimes of the previous epochs within this session to inform the ETA in later epochs.
Activity