We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):
[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]
The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:
For the first epoch in a given session, it gives a huge ETA since MPI_Model.num_so_far is zero, resulting in work_so_far of 0 being passed to:
MPI_Model.num_so_far
work_so_far
plasma-python/plasma/models/mpi_runner.py
Lines 613 to 616 in c82ba61
total_time
For later epochs within a session, it gives a minuscule ETA:
step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
E.g. here are the ETAs for some later epoch:
ETA: 0.55s ETA: 22.14 ETA: 27.98 ETA: 31.63 ETA: 35.88 ETA: 38.45 ETA: 34.89 ETA: 36.21 ETA: 35.35 ETA: 35.56 ETA: 36.04 ETA: 35.88 ETA: 35.33 ETA: 34.49 ETA: 34.73 ETA: 34.29 ETA: 34.13 ETA: 33.51 ETA: 33.16 … ETA: 1.35s ETA: 1.06s ETA: 0.67s ETA: 0.11s ETA: -0.45
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):
The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:
First step
For the first epoch in a given session, it gives a huge ETA since
MPI_Model.num_so_far
is zero, resulting inwork_so_far
of 0 being passed to:plasma-python/plasma/models/mpi_runner.py
Lines 613 to 616 in c82ba61
causing
total_time
to explode.For later epochs within a session, it gives a minuscule ETA:
Later steps in later epochs
E.g. here are the ETAs for some later epoch:
The text was updated successfully, but these errors were encountered: