Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jan 7, 2026

⚡️ This pull request contains optimizations for PR #4169

If you approve this dependent PR, these changes will be merged into the original PR branch fix/partially-filled-inferred-layout-mark-as-not-extracted.

This PR will be automatically closed if the original PR is merged.


📄 37% (0.37x) speedup for merge_out_layout_with_ocr_layout in unstructured/partition/pdf_image/ocr.py

⏱️ Runtime : 229 milliseconds 167 milliseconds (best of 6 runs)

📝 Explanation and details

The optimized code achieves a 37% speedup primarily through eliminating redundant slice operations in the aggregate_embedded_text_by_block function, which is called repeatedly in hot paths during PDF/image OCR processing.

Key Optimization

Caching the sliced source regions (sliced_source = source_regions.slice(mask)):

  • Original: Called source_regions.slice(mask) three separate times - once for text aggregation, once for coordinates, and once for checking extraction flags
  • Optimized: Computes the slice once and reuses it, avoiding two redundant slice operations per call

Performance Impact:

  • The all() check (line profiler) dropped from ~42ms to ~10.5ms (75% faster)
  • Text aggregation improved from ~41ms to ~4.3ms (90% faster)
  • Overall aggregate_embedded_text_by_block improved from 428ms to 335ms (22% faster)

Why This Matters

Based on function_references, merge_out_layout_with_ocr_layout is called from supplement_page_layout_with_ocr in a critical OCR processing path. The function processes each page layout element, and when OCR mode is FULL_PAGE, it calls aggregate_embedded_text_by_block for every element with invalid text (typically elements with "(cid:" placeholders).

Test Results Show:

  • Large-scale scenarios benefit most: 39.8% faster with 500 elements (108ms → 78ms)
  • Mixed valid/invalid texts: 39.1% faster (55.8ms → 40.1ms)
  • Small cases show 5-12% improvements due to reduced overhead

Technical Explanation

The slice operation source_regions.slice(mask) creates a new TextRegions object with filtered arrays. When mask has many True values (common when OCR finds text), this involves:

  1. Array indexing operations on multiple internal arrays (texts, element_coords, is_extracted_array)
  2. Object construction overhead

Repeating this 3x per function call, across 1000+ invocations (500 invalid elements × 2 calls), compounds the waste significantly. The optimization leverages Python's reference semantics - storing the slice result once eliminates 2/3 of this work.

Workload Impact

This optimization is particularly effective for:

  • Documents with many OCR-detected regions (typical PDFs with scanned content)
  • Pages with numerous invalid text elements requiring OCR supplementation
  • Batch processing pipelines where supplement_page_layout_with_ocr is called repeatedly

The speedup scales with the number of invalid text elements that need OCR aggregation, making it especially valuable in production OCR workflows.

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 24 Passed
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 20 Passed
📊 Tests Coverage 100.0%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_cid_code 2.09ms 2.06ms 1.66%✅
partition/pdf_image/test_ocr.py::test_merge_out_layout_with_ocr_layout 1.95ms 1.91ms 2.24%✅
🌀 Click to see Generated Regression Tests
import numpy as np

# imports
from unstructured_inference.inference.elements import TextRegions
from unstructured_inference.inference.layoutelement import LayoutElements

from unstructured.partition.pdf_image.ocr import merge_out_layout_with_ocr_layout


# Helper to build LayoutElements and TextRegions for tests
def make_layout_elements(coords, texts, sources=None):
    # coords: list of [x1, y1, x2, y2]
    # texts: list of str
    # sources: list of str or None
    arr_coords = np.array(coords, dtype=float)
    arr_texts = np.array(texts, dtype=object)
    arr_sources = np.array(sources if sources is not None else ["model"] * len(texts), dtype=object)
    return LayoutElements(
        element_coords=arr_coords,
        texts=arr_texts,
        sources=arr_sources,
        element_class_ids=np.zeros(arr_texts.shape),
        element_class_id_map={0: "UNCATEGORIZED_TEXT"},
    )


def make_text_regions(coords, texts, sources=None):
    arr_coords = np.array(coords, dtype=float)
    arr_texts = np.array(texts, dtype=object)
    arr_sources = np.array(sources if sources is not None else ["ocr"] * len(texts), dtype=object)
    return TextRegions(
        element_coords=arr_coords,
        texts=arr_texts,
        sources=arr_sources,
        is_extracted_array=np.array([True] * len(texts)),
    )


# --------------------------
# 1. Basic Test Cases
# --------------------------


def test_merge_no_invalid_text_returns_original():
    # All texts valid, nothing should change
    coords = [[0, 0, 1, 1], [2, 2, 3, 3]]
    texts = ["Hello", "World"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, ["OCR1", "OCR2"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 149μs -> 151μs (1.26% slower)


def test_merge_invalid_text_replaced_by_ocr():
    # One invalid text, should be replaced by OCR if overlapping
    coords = [[0, 0, 1, 1], [2, 2, 3, 3]]
    texts = ["(cid:123)", "World"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, ["OCR1", "OCR2"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 342μs -> 315μs (8.81% faster)


def test_merge_invalid_text_no_ocr_overlap():
    # Invalid text, but no OCR region overlaps, should become empty
    coords = [[0, 0, 1, 1]]
    texts = ["(cid:999)"]
    out_layout = make_layout_elements(coords, texts)
    ocr_coords = [[10, 10, 11, 11]]  # Far away
    ocr_layout = make_text_regions(ocr_coords, ["FarAway"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 790μs -> 797μs (0.813% slower)


def test_merge_supplement_with_ocr_elements_false():
    # supplement_with_ocr_elements disables supplementing
    coords = [[0, 0, 1, 1]]
    texts = ["(cid:123)"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, ["OCR1"])
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=False
    )
    result = codeflash_output  # 208μs -> 186μs (11.8% faster)


def test_merge_supplement_with_ocr_elements_true_adds_extra():
    # If OCR has regions not covered by out_layout, they are added
    coords = [[0, 0, 1, 1]]
    texts = ["Hello"]
    out_layout = make_layout_elements(coords, texts)
    ocr_coords = [[0, 0, 1, 1], [2, 2, 3, 3]]  # Second region not covered
    ocr_layout = make_text_regions(ocr_coords, ["OCR1", "ExtraOCR"])
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=True
    )
    result = codeflash_output  # 594μs -> 602μs (1.26% slower)


# --------------------------
# 2. Edge Test Cases
# --------------------------


def test_merge_empty_out_layout_returns_out_layout():
    # If out_layout is empty, should return it unchanged
    out_layout = make_layout_elements([], [])
    ocr_layout = make_text_regions([[0, 0, 1, 1]], ["OCR1"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 1.25μs -> 1.31μs (4.57% slower)


def test_merge_empty_ocr_layout_returns_out_layout():
    # If ocr_layout is empty, should return out_layout unchanged
    coords = [[0, 0, 1, 1]]
    texts = ["Hello"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions([], [])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 1.60μs -> 1.54μs (3.89% faster)


def test_merge_all_invalid_texts_and_no_ocr():
    # All texts invalid, but ocr_layout is empty, so all replaced with empty
    coords = [[0, 0, 1, 1], [2, 2, 3, 3]]
    texts = ["(cid:1)", "(cid:2)"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions([], [])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 1.45μs -> 1.63μs (11.0% slower)


def test_merge_non_ascii_valid_text():
    # Non-ascii but not containing (cid: should be valid and not replaced
    coords = [[0, 0, 1, 1]]
    texts = ["你好"]  # Chinese, valid as per valid_text
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, ["OCR1"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 138μs -> 137μs (0.226% faster)


def test_merge_invalid_text_multiple_ocr_overlap():
    # Multiple OCR regions overlap, should aggregate text
    coords = [[0, 0, 2, 2]]
    texts = ["(cid:bad)"]
    out_layout = make_layout_elements(coords, texts)
    ocr_coords = [[0, 0, 1, 1], [1, 1, 2, 2]]
    ocr_layout = make_text_regions(ocr_coords, ["A", "B"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 339μs -> 312μs (8.78% faster)


def test_merge_out_layout_and_ocr_layout_with_different_sources():
    # Out layout and OCR layout with different sources, should not affect merge
    coords = [[0, 0, 1, 1]]
    texts = ["(cid:bad)"]
    out_layout = make_layout_elements(coords, texts, sources=["custom"])
    ocr_layout = make_text_regions(coords, ["OCR1"], sources=["ocr"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 321μs -> 292μs (9.83% faster)


# --------------------------
# 3. Large Scale Test Cases
# --------------------------


def test_merge_large_number_of_elements():
    # Test with 500 elements
    n = 500
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = [f"(cid:{i})" for i in range(n)]
    out_layout = make_layout_elements(coords, texts)
    # OCR layout: every element overlaps exactly
    ocr_layout = make_text_regions(coords, [f"OCR{i}" for i in range(n)])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 108ms -> 78.0ms (39.8% faster)


def test_merge_large_number_of_elements_with_some_valid():
    # Mix of valid and invalid texts
    n = 500
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = [f"Valid{i}" if i % 2 == 0 else f"(cid:{i})" for i in range(n)]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, [f"OCR{i}" for i in range(n)])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 55.8ms -> 40.1ms (39.1% faster)
    # Odd indices replaced, even indices unchanged
    for i in range(n):
        if i % 2 == 0:
            pass
        else:
            pass


def test_merge_large_ocr_layout_supplement():
    # Out layout covers only half, OCR layout has extra regions
    n = 500
    out_coords = [[i, i, i + 1, i + 1] for i in range(n // 2)]
    out_texts = [f"(cid:{i})" for i in range(n // 2)]
    out_layout = make_layout_elements(out_coords, out_texts)
    ocr_coords = [[i, i, i + 1, i + 1] for i in range(n)]
    ocr_texts = [f"OCR{i}" for i in range(n)]
    ocr_layout = make_text_regions(ocr_coords, ocr_texts)
    codeflash_output = merge_out_layout_with_ocr_layout(
        out_layout, ocr_layout, supplement_with_ocr_elements=True
    )
    result = codeflash_output  # 55.9ms -> 40.6ms (37.8% faster)
    # First n//2 should be replaced by corresponding OCR
    for i in range(n // 2):
        pass
    # Remaining should be OCR-only
    for i in range(n // 2, n):
        pass


def test_merge_large_empty_ocr_layout():
    # Large out_layout, empty ocr_layout
    n = 500
    coords = [[i, i, i + 1, i + 1] for i in range(n)]
    texts = [f"(cid:{i})" for i in range(n)]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions([], [])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 2.04μs -> 1.81μs (12.7% faster)


def test_merge_large_empty_out_layout():
    # Large ocr_layout, empty out_layout
    n = 500
    out_layout = make_layout_elements([], [])
    ocr_coords = [[i, i, i + 1, i + 1] for i in range(n)]
    ocr_texts = [f"OCR{i}" for i in range(n)]
    ocr_layout = make_text_regions(ocr_coords, ocr_texts)
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 1.34μs -> 1.44μs (6.87% slower)


# --------------------------
# Miscellaneous/Regression
# --------------------------


def test_merge_preserves_order_and_types():
    # Ensure order and types are preserved after merge
    coords = [[0, 0, 1, 1], [2, 2, 3, 3]]
    texts = ["(cid:bad)", "Good"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, ["OCR1", "OCR2"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 340μs -> 315μs (7.83% faster)


def test_merge_out_layout_with_no_text_attribute():
    # LayoutElements with no texts should not fail
    coords = [[0, 0, 1, 1]]
    out_layout = make_layout_elements(coords, [""])
    ocr_layout = make_text_regions(coords, ["OCR1"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 315μs -> 292μs (7.63% faster)


def test_merge_ocr_layout_with_empty_text():
    # OCR region with empty text should not replace invalid out_layout text
    coords = [[0, 0, 1, 1]]
    texts = ["(cid:bad)"]
    out_layout = make_layout_elements(coords, texts)
    ocr_layout = make_text_regions(coords, [""])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 307μs -> 291μs (5.49% faster)


def test_merge_out_layout_with_multiple_invalid_texts_and_partial_ocr_overlap():
    # Some invalid texts overlap with OCR, some do not
    coords = [[0, 0, 1, 1], [10, 10, 11, 11]]
    texts = ["(cid:bad)", "(cid:bad2)"]
    out_layout = make_layout_elements(coords, texts)
    ocr_coords = [[0, 0, 1, 1]]
    ocr_layout = make_text_regions(ocr_coords, ["OCR1"])
    codeflash_output = merge_out_layout_with_ocr_layout(out_layout, ocr_layout)
    result = codeflash_output  # 440μs -> 421μs (4.62% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr4169-2026-01-07T19.04.59 and push.

Codeflash Static Badge

The optimized code achieves a **37% speedup** primarily through **eliminating redundant slice operations** in the `aggregate_embedded_text_by_block` function, which is called repeatedly in hot paths during PDF/image OCR processing.

## Key Optimization

**Caching the sliced source regions** (`sliced_source = source_regions.slice(mask)`):
- **Original**: Called `source_regions.slice(mask)` three separate times - once for text aggregation, once for coordinates, and once for checking extraction flags
- **Optimized**: Computes the slice once and reuses it, avoiding two redundant slice operations per call

**Performance Impact**:
- The `all()` check (line profiler) dropped from ~42ms to ~10.5ms (75% faster)
- Text aggregation improved from ~41ms to ~4.3ms (90% faster)  
- Overall `aggregate_embedded_text_by_block` improved from 428ms to 335ms (22% faster)

## Why This Matters

Based on `function_references`, `merge_out_layout_with_ocr_layout` is called from `supplement_page_layout_with_ocr` in a **critical OCR processing path**. The function processes each page layout element, and when OCR mode is `FULL_PAGE`, it calls `aggregate_embedded_text_by_block` for **every element with invalid text** (typically elements with "(cid:" placeholders).

**Test Results Show**:
- Large-scale scenarios benefit most: **39.8% faster** with 500 elements (108ms → 78ms)
- Mixed valid/invalid texts: **39.1% faster** (55.8ms → 40.1ms)
- Small cases show 5-12% improvements due to reduced overhead

## Technical Explanation

The slice operation `source_regions.slice(mask)` creates a new `TextRegions` object with filtered arrays. When mask has many True values (common when OCR finds text), this involves:
1. Array indexing operations on multiple internal arrays (`texts`, `element_coords`, `is_extracted_array`)
2. Object construction overhead

Repeating this 3x per function call, across 1000+ invocations (500 invalid elements × 2 calls), compounds the waste significantly. The optimization leverages Python's reference semantics - storing the slice result once eliminates 2/3 of this work.

## Workload Impact

This optimization is particularly effective for:
- **Documents with many OCR-detected regions** (typical PDFs with scanned content)
- **Pages with numerous invalid text elements** requiring OCR supplementation
- **Batch processing pipelines** where `supplement_page_layout_with_ocr` is called repeatedly

The speedup scales with the number of invalid text elements that need OCR aggregation, making it especially valuable in production OCR workflows.
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash labels Jan 7, 2026
@qued qued closed this Jan 7, 2026
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr4169-2026-01-07T19.04.59 branch January 7, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants