You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs.
The premise of this issue is that:
The number of steps (iterations) per training epoch should be roughly constant across all epochs.
However, I am not entirely sure that this premise is correct. See below section.
Mini-batches are created by distributing the trimmed and resampled shot signals into chunks of LSTM length, typically length=128 ms when dt=0.001; this is the horizontal dimension of a min-batch.
The other, vertical dimension of a mini-batch is the local batch_size. Ideally, each shot is uniquely "owned" by a single GPU (or model replica) for nsteps = nchunks, which depends on the particular pulse length. This varies by 1-2 orders of magnitude, with the minimum shot length = 2*length + T_min_warning = 280 ms, typically. Ignoring any nuanced trimming of the processed shots (i.e. I floored the division into chunks):
Double check the trimming of resampled shots in order to have an integer number of chunks. Is it trimmed only at the beginning of the shot? How does conf['training']['paths'] = True affect this?
From the Methods appendix of the Nature paper:
... Because there is a persistent internal state between successive chunks in time, it is not possible to use more than one chunk from a given shot in a given mini-batch (chunks that are successive in the shot must also be presented to the RNN in successive mini-batches during training such that the internal state can persist correctly).
To train batchwise with a batch size of M, we need M independent (that is, stemming from different shots) time slices of equal length to feed to the GPU.
However, if effective_batch_size = N_GPU*batch_size is greater than the number of training shots (1734 shots, which is easy to exceed with 4 GPUs and batch_size=512, e.g.), then each step must involve some shots appearing twice in the overall batch. Even for smaller effective batch sizes, the batch generator must backfill the mini-batches with repeats at later steps in the epoch as the longer pulses require many more steps to process all the chunks in the shot.
Double check how open batch-indices are handled near the end of an epoch.
For a recent experiment with 1 GPU, batch_size=256, D3D 0D training, the final step of the first epoch is written to stdout as:
In this example, 1794.00 is MPIModel.num_so_far, which is always printed out with fractional precision, but never shows anything other than integer values. Note, 1789 shots is more than in the original D3D training set due to a change in signals.py that I was messing around with.
This might be because the value actually is incremented by 1 when a shot's first chunk initially appears in a mini-batch index. If so, either remove the fractional precision in the output, or modify the variable so that it accurately computes num_chunks_so_far/num_chunks_total_this_shot.
By searching the stdout of FRNN with grep -nriB 1 "seconds" output.txt, I observe 143, 151, 264, 263, and 266 steps for the first 5 epochs.
For 1 GPU and batch_size=128: 416, 529, 529, 529, 531 steps.
For 1 GPU and batch_size=128 (restarted/loaded epoch 1 weights): 416, 529, 529, 528, 529, 531.
For 4 GPU and batch_size=128: 74, 75, 133, 132 steps
In other words, for the default PRNG seed, the initial epochs within a training session shuffle mini-batches far closer to the optimal schedule than the later epochs. See below analysis.
This variation had not really been noticed in earlier training by @jnkh nor @ge-dong, since conf['training']['num_batches_minimum'] = 200 in their tests (as opposed to the default value of 20 in the repository's conf.yaml), which is much larger than the typical number of required steps for an epoch of 128ms chunks of our original D3D dataset and effective_batch_size=512
Rename num_batches_minimum to num_steps_minimum?
It is unclear if the above variable-step phenomenon was happening on older versions of FRNN and was masked by this parameter. However, I did confirm that the code has always been printing out .00 values for MPIModel.num_so_far.
I am not sure if this phenomenon has affected training accuracy at all.
Multiprocessor scheduling problem
The random shuffling of shots loaded into mini-batches is effectively the List Scheduling algorithm applied to a shuffled list of jobs (shots) j_i of variable sizes = nchunks_i. Each batch index in effective_batch_size is an independent, identical "worker". The multiprocessor scheduling problem seeks to find the optimal schedule of j_i assigned to each of the m workers in order to minimize the makespan = the earliest time at which all jobs are completed. Here, we have no inter-job precedence/dependencies, nor individual worker constraints. Still, this problem is strongly NP-hard, since the decision problem variant ("Does a feasible schedule S exist that satisfies f(S)<= k?" for a given threshold k) is NP-complete.
In this particular heuristic algorithm for the scheduling problem, each next job (shot) is assigned to the worker (batch index) which becomes empty soonest, given some arbitrary ordered input (training buffer). In the worst-case for the List Scheduling algorithm, the longest shot is loaded into a mini-batch last, and the makespan is maximized. Hence, it returns a makespan which within a factor of 2 - 1/m of the optimal value.
By contrast, the Longest Processing Time First Rule (LPT) algorithm first sorts the jobs according to non-increasing processing time (largest nchunks to smallest) and returns a makespan within a factor of 4/3-1/(3*m) of the optimal makespan.
Note, we are not trying to minimize the makespan or find the most efficient mini-batching strategy in FRNN, since we rely on the random shuffling to stabilize training. However, this analysis applied to our D3D chunks can give us some expectation on how much variability in steps/epoch is normal.
Here, I apply both algorithms to the D3D training set:
The latter two cases are in line with my observations (although these were computed from a slightly different training set; see above comment about changes to signals.py on Traverse). Therefore, this variability of nsteps/epoch might be expected, and not a bug.
The text was updated successfully, but these errors were encountered:
Observed on both TigerGPU and Traverse, on
master
before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs.The premise of this issue is that:
However, I am not entirely sure that this premise is correct. See below section.
Mini-batches are created by distributing the trimmed and resampled shot signals into chunks of LSTM length, typically
length=128
ms whendt=0.001
; this is the horizontal dimension of a min-batch.The other, vertical dimension of a mini-batch is the local
batch_size
. Ideally, each shot is uniquely "owned" by a single GPU (or model replica) fornsteps
=nchunks
, which depends on the particular pulse length. This varies by 1-2 orders of magnitude, with the minimum shot length= 2*length + T_min_warning = 280
ms, typically. Ignoring any nuanced trimming of the processed shots (i.e. I floored the division into chunks):conf['training']['paths'] = True
affect this?From the Methods appendix of the Nature paper:
However, if
effective_batch_size = N_GPU*batch_size
is greater than the number of training shots (1734 shots, which is easy to exceed with 4 GPUs andbatch_size=512
, e.g.), then each step must involve some shots appearing twice in the overall batch. Even for smaller effective batch sizes, the batch generator must backfill the mini-batches with repeats at later steps in the epoch as the longer pulses require many more steps to process all the chunks in the shot.For a recent experiment with 1 GPU,
batch_size=256
, D3D 0D training, the final step of the first epoch is written to stdout as:In this example,
1794.00
isMPIModel.num_so_far
, which is always printed out with fractional precision, but never shows anything other than integer values. Note, 1789 shots is more than in the original D3D training set due to a change insignals.py
that I was messing around with.num_chunks_so_far/num_chunks_total_this_shot
.By searching the stdout of FRNN with
grep -nriB 1 "seconds" output.txt
, I observe 143, 151, 264, 263, and 266 steps for the first 5 epochs.batch_size=128
: 416, 529, 529, 529, 531 steps.batch_size=128
(restarted/loaded epoch 1 weights): 416, 529, 529, 528, 529, 531.batch_size=128
: 74, 75, 133, 132 stepsIn other words, for the default PRNG seed, the initial epochs within a training session shuffle mini-batches far closer to the optimal schedule than the later epochs. See below analysis.
This variation had not really been noticed in earlier training by @jnkh nor @ge-dong, since
conf['training']['num_batches_minimum'] = 200
in their tests (as opposed to the default value of 20 in the repository'sconf.yaml
), which is much larger than the typical number of required steps for an epoch of 128ms chunks of our original D3D dataset andeffective_batch_size=512
num_batches_minimum
tonum_steps_minimum
?It is unclear if the above variable-step phenomenon was happening on older versions of FRNN and was masked by this parameter. However, I did confirm that the code has always been printing out
.00
values forMPIModel.num_so_far
.I am not sure if this phenomenon has affected training accuracy at all.
Multiprocessor scheduling problem
The random shuffling of shots loaded into mini-batches is effectively the List Scheduling algorithm applied to a shuffled list of jobs (shots)
j_i
of variable sizes= nchunks_i
. Each batch index ineffective_batch_size
is an independent, identical "worker". The multiprocessor scheduling problem seeks to find the optimal schedule ofj_i
assigned to each of them
workers in order to minimize the makespan = the earliest time at which all jobs are completed. Here, we have no inter-job precedence/dependencies, nor individual worker constraints. Still, this problem is strongly NP-hard, since the decision problem variant ("Does a feasible schedule S exist that satisfies f(S)<= k?" for a given threshold k) is NP-complete.In this particular heuristic algorithm for the scheduling problem, each next job (shot) is assigned to the worker (batch index) which becomes empty soonest, given some arbitrary ordered input (training buffer). In the worst-case for the List Scheduling algorithm, the longest shot is loaded into a mini-batch last, and the makespan is maximized. Hence, it returns a makespan which within a factor of
2 - 1/m
of the optimal value.By contrast, the Longest Processing Time First Rule (LPT) algorithm first sorts the jobs according to non-increasing processing time (largest
nchunks
to smallest) and returns a makespan within a factor of4/3-1/(3*m)
of the optimal makespan.Note, we are not trying to minimize the makespan or find the most efficient mini-batching strategy in FRNN, since we rely on the random shuffling to stabilize training. However, this analysis applied to our D3D chunks can give us some expectation on how much variability in steps/epoch is normal.
Here, I apply both algorithms to the D3D training set:
For
effective_batch_size = 512
:For
effective_batch_size = 256
:For
effective_batch_size = 128
:The latter two cases are in line with my observations (although these were computed from a slightly different training set; see above comment about changes to
signals.py
on Traverse). Therefore, this variability of nsteps/epoch might be expected, and not a bug.The text was updated successfully, but these errors were encountered: