⚡️ Speed up function merge_out_layout_with_ocr_layout by 37% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted)
#4171
+5
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #4169
If you approve this dependent PR, these changes will be merged into the original PR branch
fix/partially-filled-inferred-layout-mark-as-not-extracted.📄 37% (0.37x) speedup for
merge_out_layout_with_ocr_layoutinunstructured/partition/pdf_image/ocr.py⏱️ Runtime :
229 milliseconds→167 milliseconds(best of6runs)📝 Explanation and details
The optimized code achieves a 37% speedup primarily through eliminating redundant slice operations in the
aggregate_embedded_text_by_blockfunction, which is called repeatedly in hot paths during PDF/image OCR processing.Key Optimization
Caching the sliced source regions (
sliced_source = source_regions.slice(mask)):source_regions.slice(mask)three separate times - once for text aggregation, once for coordinates, and once for checking extraction flagsPerformance Impact:
all()check (line profiler) dropped from ~42ms to ~10.5ms (75% faster)aggregate_embedded_text_by_blockimproved from 428ms to 335ms (22% faster)Why This Matters
Based on
function_references,merge_out_layout_with_ocr_layoutis called fromsupplement_page_layout_with_ocrin a critical OCR processing path. The function processes each page layout element, and when OCR mode isFULL_PAGE, it callsaggregate_embedded_text_by_blockfor every element with invalid text (typically elements with "(cid:" placeholders).Test Results Show:
Technical Explanation
The slice operation
source_regions.slice(mask)creates a newTextRegionsobject with filtered arrays. When mask has many True values (common when OCR finds text), this involves:texts,element_coords,is_extracted_array)Repeating this 3x per function call, across 1000+ invocations (500 invalid elements × 2 calls), compounds the waste significantly. The optimization leverages Python's reference semantics - storing the slice result once eliminates 2/3 of this work.
Workload Impact
This optimization is particularly effective for:
supplement_page_layout_with_ocris called repeatedlyThe speedup scales with the number of invalid text elements that need OCR aggregation, making it especially valuable in production OCR workflows.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_codepartition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-pr4169-2026-01-07T19.04.59and push.