ETA calculation is inaccurate

Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):
```
[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]
```
The ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:
1. The ETA computed in the first step of any epoch is always inaccurate. 
2. For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically. 

### First step

For the first epoch in a given session, it gives a huge ETA since `MPI_Model.num_so_far` is zero, resulting in `work_so_far` of 0 being passed to:
https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L613-L616
causing `total_time` to explode. 
- [ ] Probably should just refuse to give an ETA for the first step (or steps) of the first epoch

For later epochs within a session, it gives a minuscule ETA:
```
step: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]
``` 
- [ ] I think an error was introduced when I changed the 0-based indexing of the epochs 1-2 months ago.

### Later steps in later epochs

E.g. here are the ETAs for some later epoch:
```

ETA: 0.55s
ETA: 22.14
ETA: 27.98
ETA: 31.63
ETA: 35.88
ETA: 38.45
ETA: 34.89
ETA: 36.21
ETA: 35.35
ETA: 35.56
ETA: 36.04
ETA: 35.88
ETA: 35.33
ETA: 34.49
ETA: 34.73
ETA: 34.29
ETA: 34.13
ETA: 33.51
ETA: 33.16
…
ETA: 1.35s
ETA: 1.06s
ETA: 0.67s
ETA: 0.11s
ETA: -0.45
```
- [ ] Consider using the measured runtimes of the previous epochs within this session to inform the ETA in later epochs. 

	def estimate_remaining_time(self, time_so_far, work_so_far, work_total):
	eps = 1e-6
	total_time = 1.0time_so_farwork_total/(work_so_far + eps)
	return total_time - time_so_far

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ETA calculation is inaccurate #55

First step

Later steps in later epochs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

ETA calculation is inaccurate #55

Description

First step

Later steps in later epochs

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions