⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

codeflash-ai · 2026-01-07T19:02:07Z

⚡️ This pull request contains optimizations for PR #4169

If you approve this dependent PR, these changes will be merged into the original PR branch fix/partially-filled-inferred-layout-mark-as-not-extracted.

This PR will be automatically closed if the original PR is merged.

📄 142% (1.42x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

⏱️ Runtime : 4.05 milliseconds → 1.67 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 142% speedup by replacing repeated individual function calls with a single batched operation, eliminating redundant computation overhead.

Key Optimization: Batch Processing

What Changed:

Original: Called aggregate_embedded_text_by_block() separately for each invalid text element in a loop (up to N times)
Optimized: Introduced aggregate_embedded_text_batch() that processes all invalid text indices in a single operation

Why This Is Faster:

Eliminates Repeated Geometric Computations: The original code called bboxes1_is_almost_subregion_of_bboxes2() N times (once per invalid element). The optimized version calls it once with all target coordinates, computing a 2D mask (sources × targets) in a vectorized NumPy operation. This exploits NumPy's highly optimized C implementation.
Reduces Function Call Overhead: Python function calls have significant overhead (~500-1000ns each). The loop in merge_out_layout_with_ocr_layout was calling aggregate_embedded_text_by_block() + out_layout.slice([idx]) repeatedly. Batching eliminates most of these calls.
Defers Unnecessary Work: The original code performed type conversion out_layout.texts.astype(object) unconditionally. The optimized version only does this if there are actually invalid text indices to process.
Minor Simplification: valid_text() was refactored from an if-statement to a single boolean expression (return text and "(cid:" not in text), reducing interpreter overhead slightly.

Performance Evidence:

Line profiler shows merge_out_layout_with_ocr_layout dropped from 18.1ms → 10.5ms (42% faster)
The loop processing invalid indices went from 36.5% of total time (6.61ms across 58 hits) to 14.1% (1.48ms across 22 hits)
valid_text() improved from 795μs → 428μs (46% faster) due to the simplified boolean expression

Impact on Real Workloads:
Based on function_references, this optimization directly benefits OCR processing pipelines where merge_out_layout_with_ocr_layout is called from supplement_page_layout_with_ocr() in OCRMode.FULL_PAGE mode. When processing documents with multiple pages or elements requiring OCR text aggregation, the batched approach scales linearly instead of quadratically with the number of invalid text regions.

Test Case Performance:
The annotated tests show 6-16% speedup on edge cases (empty layouts), confirming the optimization doesn't degrade performance in boundary conditions while delivering substantial gains when processing multiple invalid text elements.

✅ Correctness verification report:

Test	Status
⏪ Replay Tests	🔘 None Found
⚙️ Existing Unit Tests	✅ 24 Passed
🔎 Concolic Coverage Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 2 Passed
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_code`	2.11ms	915μs	130%✅
`partition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout`	1.94ms	755μs	157%✅

🌀 Click to see Generated Regression Tests

import numpy as np

# imports
from unstructured_inference.inference.elements import TextRegions
from unstructured_inference.inference.layoutelement import LayoutElements

from unstructured.partition.pdf_image.ocr import merge_out_layout_with_ocr_layout


# Helper function to create dummy LayoutElements and TextRegions
def make_layout_elements(texts, coords=None, sources=None, class_ids=None):
    if coords is None:
        coords = np.array([[[0, 0, 1, 1]] for _ in texts], dtype=float)
    if sources is None:
        sources = np.array(["out"] * len(texts))
    if class_ids is None:
        class_ids = np.zeros(len(texts))
    return LayoutElements(
        element_coords=np.array(coords, dtype=float),
        texts=np.array(texts, dtype=object),
        sources=np.array(sources, dtype=object),
        element_class_ids=np.array(class_ids, dtype=int),
        element_class_id_map={0: "UNCATEGORIZED_TEXT"},
    )


def make_text_regions(texts, coords=None, sources=None, is_extracted=None):
    if coords is None:
        coords = np.array([[[0, 0, 1, 1]] for _ in texts], dtype=float)
    if sources is None:
        sources = np.array(["ocr"] * len(texts))
    if is_extracted is None:
        is_extracted = np.array([True] * len(texts))
    return TextRegions(
        element_coords=np.array(coords, dtype=float),
        texts=np.array(texts, dtype=object),
        sources=np.array(sources, dtype=object),
        is_extracted_array=np.array(is_extracted, dtype=bool),
    )


# ------------------------------
# 1. Basic Test Cases
# ------------------------------


def test_merge_empty_out_layout():
    # out_layout is empty, should return it unchanged
    out_layout = make_layout_elements([], coords=[], sources=[], class_ids=[])
    ocr_layout = make_text_regions(["OCR text"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 2.03μs -> 1.90μs (6.88% faster)


def test_merge_empty_ocr_layout():
    # ocr_layout is empty, should return out_layout unchanged
    out_layout = make_layout_elements(["Text"])
    ocr_layout = make_text_regions([], coords=[], sources=[], is_extracted=[])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 1.96μs -> 1.69μs (16.0% faster)

To edit these changes git checkout codeflash/optimize-pr4169-2026-01-07T19.02.00 and push.

…inferred-layout-mark-as-not-extracted

The optimized code achieves a **142% speedup** by replacing repeated individual function calls with a single batched operation, eliminating redundant computation overhead. ## Key Optimization: Batch Processing **What Changed:** - **Original**: Called `aggregate_embedded_text_by_block()` separately for each invalid text element in a loop (up to N times) - **Optimized**: Introduced `aggregate_embedded_text_batch()` that processes all invalid text indices in a single operation **Why This Is Faster:** 1. **Eliminates Repeated Geometric Computations**: The original code called `bboxes1_is_almost_subregion_of_bboxes2()` N times (once per invalid element). The optimized version calls it **once** with all target coordinates, computing a 2D mask `(sources × targets)` in a vectorized NumPy operation. This exploits NumPy's highly optimized C implementation. 2. **Reduces Function Call Overhead**: Python function calls have significant overhead (~500-1000ns each). The loop in `merge_out_layout_with_ocr_layout` was calling `aggregate_embedded_text_by_block()` + `out_layout.slice([idx])` repeatedly. Batching eliminates most of these calls. 3. **Defers Unnecessary Work**: The original code performed type conversion `out_layout.texts.astype(object)` unconditionally. The optimized version only does this if there are actually invalid text indices to process. 4. **Minor Simplification**: `valid_text()` was refactored from an if-statement to a single boolean expression (`return text and "(cid:" not in text`), reducing interpreter overhead slightly. **Performance Evidence:** - Line profiler shows `merge_out_layout_with_ocr_layout` dropped from 18.1ms → 10.5ms (42% faster) - The loop processing invalid indices went from 36.5% of total time (6.61ms across 58 hits) to 14.1% (1.48ms across 22 hits) - `valid_text()` improved from 795μs → 428μs (46% faster) due to the simplified boolean expression **Impact on Real Workloads:** Based on `function_references`, this optimization directly benefits **OCR processing pipelines** where `merge_out_layout_with_ocr_layout` is called from `supplement_page_layout_with_ocr()` in `OCRMode.FULL_PAGE` mode. When processing documents with multiple pages or elements requiring OCR text aggregation, the batched approach scales linearly instead of quadratically with the number of invalid text regions. **Test Case Performance:** The annotated tests show 6-16% speedup on edge cases (empty layouts), confirming the optimization doesn't degrade performance in boundary conditions while delivering substantial gains when processing multiple invalid text elements.

codeflash-ai · 2026-01-07T22:17:09Z

This PR has been automatically closed because the original PR #4169 by badGarnet was closed.

badGarnet and others added 8 commits January 6, 2026 20:05

feat: use text coverage for an inferred region to set is_extracted

e86d0c4

fix aggregate iou computation

db2dc9c

remove debug print

e3d3894

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

b177721

…inferred-layout-mark-as-not-extracted

use config to set threshold

819956c

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

c20ac5e

…inferred-layout-mark-as-not-extracted

use partial

23c1451

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash labels Jan 7, 2026

codeflash-ai bot mentioned this pull request Jan 7, 2026

Fix: partially filled inferred layout mark as not extracted #4169

Merged

Base automatically changed from fix/partially-filled-inferred-layout-mark-as-not-extracted to main January 7, 2026 22:17

codeflash-ai bot closed this Jan 7, 2026

codeflash-ai bot deleted the codeflash/optimize-pr4169-2026-01-07T19.02.00 branch January 7, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

Uh oh!

codeflash-ai bot commented Jan 7, 2026

Uh oh!

codeflash-ai bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function merge_out_layout_with_ocr_layout by 142% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4170

⚡️ Speed up function merge_out_layout_with_ocr_layout by 142% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4170

Uh oh!

Conversation

codeflash-ai bot commented Jan 7, 2026

⚡️ This pull request contains optimizations for PR #4169

📄 142% (1.42x) speedup for merge_out_layout_with_ocr_layout in unstructured/partition/pdf_image/ocr.py

📝 Explanation and details

Key Optimization: Batch Processing

Uh oh!

codeflash-ai bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

📄 142% (1.42x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`