provide predefined element_length_fn for pipeline.bucket_boundaries #364
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
tfaip provides an option delegating to Tensorflow's
tf.data.Dataset.bucket_by_sequence_length
(also known as batch width or length bucketing). This is really helpful for optimal use of memory resources on the GPU: as one tries to increase batch size for GPU utilization one must also be wary of OOM, and this is aggrevated by zero padding – so samples with largest length can spoil an entire batch's memory utilization. So batch bucketing groups batches into similar lengths.Our CLIs even expose that (e.g.
--pipeline.bucket_boundaries 20 50 100 200 400 800
), but:--pipeline.bucket_batch_sizes 256 128 64 32 16 8
) or basically observed the opposite effectelement_length_fn
parameter required by tfaip/TF – this is dependent on the data representation, and can only be formulated as code, so it must be done in Calamari itselfThis PR addresses the final problem 2. (Note that on the API, it was already possible to achieve that, as OCR-D/ocrd_calamari#118 showcases, but it would still be simpler if it was predefined.)
Following is a plot of measured GPU utilization for a typical
calamari-predict
on some 2500 lines without batch bucketing:...and with batch bucketing (enabled by this PR):
So obviously GPU utilization gets a little less peaky.
But it's still peaky. (I have tried varying
num_processes
,prefetch
,use_shared_memory
,dist_strategy
– but the effect is mostly negligible. So if anyone has ideas, please comment.)