provide predefined element_length_fn for pipeline.bucket_boundaries #364

bertsky · 2024-10-02T15:52:05Z

tfaip provides an option delegating to Tensorflow's tf.data.Dataset.bucket_by_sequence_length (also known as batch width or length bucketing). This is really helpful for optimal use of memory resources on the GPU: as one tries to increase batch size for GPU utilization one must also be wary of OOM, and this is aggrevated by zero padding – so samples with largest length can spoil an entire batch's memory utilization. So batch bucketing groups batches into similar lengths.

Our CLIs even expose that (e.g. --pipeline.bucket_boundaries 20 50 100 200 400 800 ), but:

the formula for bucket batch sizes was broken prior to tfaip 1.2.7, so one needed to also calculate that (e.g. --pipeline.bucket_batch_sizes 256 128 64 32 16 8) or basically observed the opposite effect
another ingredient was the element_length_fn parameter required by tfaip/TF – this is dependent on the data representation, and can only be formulated as code, so it must be done in Calamari itself

This PR addresses the final problem 2. (Note that on the API, it was already possible to achieve that, as OCR-D/ocrd_calamari#118 showcases, but it would still be simpler if it was predefined.)

Following is a plot of measured GPU utilization for a typical calamari-predict on some 2500 lines without batch bucketing:

...and with batch bucketing (enabled by this PR):

So obviously GPU utilization gets a little less peaky.

But it's still peaky. (I have tried varying num_processes, prefetch, use_shared_memory, dist_strategy – but the effect is mostly negligible. So if anyone has ideas, please comment.)

…ucket_boundaries

codecov-commenter · 2024-10-02T16:09:12Z

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@f0139d6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
calamari_ocr/ocr/dataset/data.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             master    #364   +/-   ##
========================================
  Coverage          ?   0.00%           
========================================
  Files             ?     129           
  Lines             ?    6867           
  Branches          ?       0           
========================================
  Hits              ?       0           
  Misses            ?    6867           
  Partials          ?       0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

andbue

As a concept, this seems like a good idea and I don't see any downside. Before setting defaults for the command line we might want to run some more performance tests on larger datasets.

bertsky · 2024-10-11T12:49:07Z

Before setting defaults for the command line we might want to run some more performance tests on larger datasets.

Agreed. I'd also like to revive the discussion around performance in #290, now that we have v2.3. It would seem that for a fair comparison we need to look at all potential factors (batch bucketing being one of them, but not necessarily so on random data which is not as likely to have widely diverging sequence lengths as real data). What I would like to see is GPU and memory utilization (like above) in tandem with walltime and CPU-time benchmarks.

But this PR is not about new defaults, just enabling use of an existing CLI (and API) option.

What I forgot was adding a unit test for this, though.

ocr.dataset.data: provide predefined element_length_fn for pipeline.b…

27a055e

…ucket_boundaries

bertsky requested a review from andbue October 2, 2024 15:53

let black have its friggin empty line

febb2c6

andbue approved these changes Oct 11, 2024

View reviewed changes

bertsky added 2 commits October 11, 2024 15:02

test_prediction.py: add test for bucket_boundaries

b2dab6c

satisfy the stupid linter

6d63500

bertsky merged commit 7b03ba9 into Calamari-OCR:master Nov 12, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide predefined element_length_fn for pipeline.bucket_boundaries #364

provide predefined element_length_fn for pipeline.bucket_boundaries #364

bertsky commented Oct 2, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited by codecov bot

Loading

andbue left a comment

bertsky commented Oct 11, 2024

provide predefined element_length_fn for pipeline.bucket_boundaries #364

provide predefined element_length_fn for pipeline.bucket_boundaries #364

Conversation

bertsky commented Oct 2, 2024 • edited Loading

codecov-commenter commented Oct 2, 2024 • edited by codecov bot Loading

Codecov Report

andbue left a comment

Choose a reason for hiding this comment

bertsky commented Oct 11, 2024

bertsky commented Oct 2, 2024 •

edited

Loading

codecov-commenter commented Oct 2, 2024 •

edited by codecov bot

Loading