⚡️ Speed up function aggregate_embedded_text_by_block by 15% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted)
#4172
+66
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #4169
If you approve this dependent PR, these changes will be merged into the original PR branch
fix/partially-filled-inferred-layout-mark-as-not-extracted.📄 15% (0.15x) speedup for
aggregate_embedded_text_by_blockinunstructured/partition/pdf_image/pdfminer_processing.py⏱️ Runtime :
377 microseconds→328 microseconds(best of8runs)📝 Explanation and details
The optimized code achieves a 14% speedup by eliminating redundant calls to
source_regions.slice(mask).Key optimization:
The original code called
source_regions.slice(mask)three separate times:.textsfor text aggregation.element_coordsfor IoU calculation.is_extracted_arrayfor extraction verificationThe optimized version caches this result in
masked_source = source_regions.slice(mask)and reuses it. This single change reduces execution time from 377μs to 328μs.Why this matters:
According to the line profiler, each
slice()call has non-trivial overhead (~70-90μs). By caching the result, we eliminate two redundant slice operations whensum(mask) > 0. The profiler shows the cached access tomasked_source.element_coordstakes only 1.4μs vs 71μs for the repeatedsource_regions.slice(mask).element_coordscall, and accessingmasked_source.is_extracted_arraytakes 15.5μs vs 81.8μs.Impact context:
Based on the function reference,
aggregate_embedded_text_by_blockis called within a loop inmerge_out_layout_with_ocr_layout, iterating overinvalid_text_indices. For PDF documents with many text regions requiring OCR aggregation, this optimization compounds significantly. Each iteration saves ~100μs (from avoided redundant slicing), which adds up when processing documents with dozens or hundreds of text blocks.Performance characteristics:
The optimization is most effective when:
maskcontains True values (common case where text regions overlap)✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
partition/pdf_image/test_pdfminer_processing.py::test_aggregate_by_blockpartition/pdf_image/test_pdfminer_processing.py::test_aggregate_only_partially_fill_targetTo edit these changes
git checkout codeflash/optimize-pr4169-2026-01-07T19.05.25and push.