⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172

codeflash-ai · 2026-01-07T19:05:31Z

⚡️ This pull request contains optimizations for PR #4169

If you approve this dependent PR, these changes will be merged into the original PR branch fix/partially-filled-inferred-layout-mark-as-not-extracted.

This PR will be automatically closed if the original PR is merged.

📄 15% (0.15x) speedup for `aggregate_embedded_text_by_block` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 377 microseconds → 328 microseconds (best of 8 runs)

📝 Explanation and details

The optimized code achieves a 14% speedup by eliminating redundant calls to source_regions.slice(mask).

Key optimization:
The original code called source_regions.slice(mask) three separate times:

To extract .texts for text aggregation
To get .element_coords for IoU calculation
To access .is_extracted_array for extraction verification

The optimized version caches this result in masked_source = source_regions.slice(mask) and reuses it. This single change reduces execution time from 377μs to 328μs.

Why this matters:
According to the line profiler, each slice() call has non-trivial overhead (~70-90μs). By caching the result, we eliminate two redundant slice operations when sum(mask) > 0. The profiler shows the cached access to masked_source.element_coords takes only 1.4μs vs 71μs for the repeated source_regions.slice(mask).element_coords call, and accessing masked_source.is_extracted_array takes 15.5μs vs 81.8μs.

Impact context:
Based on the function reference, aggregate_embedded_text_by_block is called within a loop in merge_out_layout_with_ocr_layout, iterating over invalid_text_indices. For PDF documents with many text regions requiring OCR aggregation, this optimization compounds significantly. Each iteration saves ~100μs (from avoided redundant slicing), which adds up when processing documents with dozens or hundreds of text blocks.

Performance characteristics:
The optimization is most effective when:

mask contains True values (common case where text regions overlap)
The source_regions contain multiple elements to slice
The function is called repeatedly in hot paths (as evidenced by the OCR merge workflow)

✅ Correctness verification report:

Test	Status
⏪ Replay Tests	🔘 None Found
⚙️ Existing Unit Tests	✅ 12 Passed
🔎 Concolic Coverage Tests	🔘 None Found
🌀 Generated Regression Tests	🔘 None Found
📊 Tests Coverage	81.8%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/pdf_image/test_pdfminer_processing.py::test_aggregate_by_block`	199μs	175μs	13.4%✅
`partition/pdf_image/test_pdfminer_processing.py::test_aggregate_only_partially_fill_target`	177μs	152μs	16.7%✅

To edit these changes git checkout codeflash/optimize-pr4169-2026-01-07T19.05.25 and push.

…inferred-layout-mark-as-not-extracted

The optimized code achieves a **14% speedup** by eliminating redundant calls to `source_regions.slice(mask)`. **Key optimization:** The original code called `source_regions.slice(mask)` three separate times: 1. To extract `.texts` for text aggregation 2. To get `.element_coords` for IoU calculation 3. To access `.is_extracted_array` for extraction verification The optimized version caches this result in `masked_source = source_regions.slice(mask)` and reuses it. This single change reduces execution time from 377μs to 328μs. **Why this matters:** According to the line profiler, each `slice()` call has non-trivial overhead (~70-90μs). By caching the result, we eliminate two redundant slice operations when `sum(mask) > 0`. The profiler shows the cached access to `masked_source.element_coords` takes only 1.4μs vs 71μs for the repeated `source_regions.slice(mask).element_coords` call, and accessing `masked_source.is_extracted_array` takes 15.5μs vs 81.8μs. **Impact context:** Based on the function reference, `aggregate_embedded_text_by_block` is called within a loop in `merge_out_layout_with_ocr_layout`, iterating over `invalid_text_indices`. For PDF documents with many text regions requiring OCR aggregation, this optimization compounds significantly. Each iteration saves ~100μs (from avoided redundant slicing), which adds up when processing documents with dozens or hundreds of text blocks. **Performance characteristics:** The optimization is most effective when: - `mask` contains True values (common case where text regions overlap) - The source_regions contain multiple elements to slice - The function is called repeatedly in hot paths (as evidenced by the OCR merge workflow)

codeflash-ai · 2026-01-07T22:17:06Z

This PR has been automatically closed because the original PR #4169 by badGarnet was closed.

badGarnet and others added 8 commits January 6, 2026 20:05

feat: use text coverage for an inferred region to set is_extracted

e86d0c4

fix aggregate iou computation

db2dc9c

remove debug print

e3d3894

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

b177721

…inferred-layout-mark-as-not-extracted

use config to set threshold

819956c

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

c20ac5e

…inferred-layout-mark-as-not-extracted

use partial

23c1451

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash labels Jan 7, 2026

codeflash-ai bot mentioned this pull request Jan 7, 2026

Fix: partially filled inferred layout mark as not extracted #4169

Merged

codeflash-ai bot closed this Jan 7, 2026

Base automatically changed from fix/partially-filled-inferred-layout-mark-as-not-extracted to main January 7, 2026 22:17

codeflash-ai bot deleted the codeflash/optimize-pr4169-2026-01-07T19.05.25 branch January 7, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172

⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172

codeflash-ai bot commented Jan 7, 2026

Uh oh!

codeflash-ai bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function aggregate_embedded_text_by_block by 15% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4172

⚡️ Speed up function aggregate_embedded_text_by_block by 15% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4172

Conversation

codeflash-ai bot commented Jan 7, 2026

⚡️ This pull request contains optimizations for PR #4169

📄 15% (0.15x) speedup for aggregate_embedded_text_by_block in unstructured/partition/pdf_image/pdfminer_processing.py

📝 Explanation and details

Uh oh!

codeflash-ai bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172

⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172

📄 15% (0.15x) speedup for `aggregate_embedded_text_by_block` in `unstructured/partition/pdf_image/pdfminer_processing.py`