⚡️ Speed up function merge_out_layout_with_ocr_layout by 142% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted)
#4170
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #4169
If you approve this dependent PR, these changes will be merged into the original PR branch
fix/partially-filled-inferred-layout-mark-as-not-extracted.📄 142% (1.42x) speedup for
merge_out_layout_with_ocr_layoutinunstructured/partition/pdf_image/ocr.py⏱️ Runtime :
4.05 milliseconds→1.67 milliseconds(best of5runs)📝 Explanation and details
The optimized code achieves a 142% speedup by replacing repeated individual function calls with a single batched operation, eliminating redundant computation overhead.
Key Optimization: Batch Processing
What Changed:
aggregate_embedded_text_by_block()separately for each invalid text element in a loop (up to N times)aggregate_embedded_text_batch()that processes all invalid text indices in a single operationWhy This Is Faster:
Eliminates Repeated Geometric Computations: The original code called
bboxes1_is_almost_subregion_of_bboxes2()N times (once per invalid element). The optimized version calls it once with all target coordinates, computing a 2D mask(sources × targets)in a vectorized NumPy operation. This exploits NumPy's highly optimized C implementation.Reduces Function Call Overhead: Python function calls have significant overhead (~500-1000ns each). The loop in
merge_out_layout_with_ocr_layoutwas callingaggregate_embedded_text_by_block()+out_layout.slice([idx])repeatedly. Batching eliminates most of these calls.Defers Unnecessary Work: The original code performed type conversion
out_layout.texts.astype(object)unconditionally. The optimized version only does this if there are actually invalid text indices to process.Minor Simplification:
valid_text()was refactored from an if-statement to a single boolean expression (return text and "(cid:" not in text), reducing interpreter overhead slightly.Performance Evidence:
merge_out_layout_with_ocr_layoutdropped from 18.1ms → 10.5ms (42% faster)valid_text()improved from 795μs → 428μs (46% faster) due to the simplified boolean expressionImpact on Real Workloads:
Based on
function_references, this optimization directly benefits OCR processing pipelines wheremerge_out_layout_with_ocr_layoutis called fromsupplement_page_layout_with_ocr()inOCRMode.FULL_PAGEmode. When processing documents with multiple pages or elements requiring OCR text aggregation, the batched approach scales linearly instead of quadratically with the number of invalid text regions.Test Case Performance:
The annotated tests show 6-16% speedup on edge cases (empty layouts), confirming the optimization doesn't degrade performance in boundary conditions while delivering substantial gains when processing multiple invalid text elements.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_codepartition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-pr4169-2026-01-07T19.02.00and push.