Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jan 7, 2026

⚡️ This pull request contains optimizations for PR #4169

If you approve this dependent PR, these changes will be merged into the original PR branch fix/partially-filled-inferred-layout-mark-as-not-extracted.

This PR will be automatically closed if the original PR is merged.


📄 15% (0.15x) speedup for aggregate_embedded_text_by_block in unstructured/partition/pdf_image/pdfminer_processing.py

⏱️ Runtime : 377 microseconds 328 microseconds (best of 8 runs)

📝 Explanation and details

The optimized code achieves a 14% speedup by eliminating redundant calls to source_regions.slice(mask).

Key optimization:
The original code called source_regions.slice(mask) three separate times:

  1. To extract .texts for text aggregation
  2. To get .element_coords for IoU calculation
  3. To access .is_extracted_array for extraction verification

The optimized version caches this result in masked_source = source_regions.slice(mask) and reuses it. This single change reduces execution time from 377μs to 328μs.

Why this matters:
According to the line profiler, each slice() call has non-trivial overhead (~70-90μs). By caching the result, we eliminate two redundant slice operations when sum(mask) > 0. The profiler shows the cached access to masked_source.element_coords takes only 1.4μs vs 71μs for the repeated source_regions.slice(mask).element_coords call, and accessing masked_source.is_extracted_array takes 15.5μs vs 81.8μs.

Impact context:
Based on the function reference, aggregate_embedded_text_by_block is called within a loop in merge_out_layout_with_ocr_layout, iterating over invalid_text_indices. For PDF documents with many text regions requiring OCR aggregation, this optimization compounds significantly. Each iteration saves ~100μs (from avoided redundant slicing), which adds up when processing documents with dozens or hundreds of text blocks.

Performance characteristics:
The optimization is most effective when:

  • mask contains True values (common case where text regions overlap)
  • The source_regions contain multiple elements to slice
  • The function is called repeatedly in hot paths (as evidenced by the OCR merge workflow)

Correctness verification report:

Test Status
⏪ Replay Tests 🔘 None Found
⚙️ Existing Unit Tests 12 Passed
🔎 Concolic Coverage Tests 🔘 None Found
🌀 Generated Regression Tests 🔘 None Found
📊 Tests Coverage 81.8%
⚙️ Click to see Existing Unit Tests
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
partition/pdf_image/test_pdfminer_processing.py::test_aggregate_by_block 199μs 175μs 13.4%✅
partition/pdf_image/test_pdfminer_processing.py::test_aggregate_only_partially_fill_target 177μs 152μs 16.7%✅

To edit these changes git checkout codeflash/optimize-pr4169-2026-01-07T19.05.25 and push.

Codeflash Static Badge

badGarnet and others added 8 commits January 6, 2026 20:05
The optimized code achieves a **14% speedup** by eliminating redundant calls to `source_regions.slice(mask)`. 

**Key optimization:**
The original code called `source_regions.slice(mask)` three separate times:
1. To extract `.texts` for text aggregation
2. To get `.element_coords` for IoU calculation  
3. To access `.is_extracted_array` for extraction verification

The optimized version caches this result in `masked_source = source_regions.slice(mask)` and reuses it. This single change reduces execution time from 377μs to 328μs.

**Why this matters:**
According to the line profiler, each `slice()` call has non-trivial overhead (~70-90μs). By caching the result, we eliminate two redundant slice operations when `sum(mask) > 0`. The profiler shows the cached access to `masked_source.element_coords` takes only 1.4μs vs 71μs for the repeated `source_regions.slice(mask).element_coords` call, and accessing `masked_source.is_extracted_array` takes 15.5μs vs 81.8μs.

**Impact context:**
Based on the function reference, `aggregate_embedded_text_by_block` is called within a loop in `merge_out_layout_with_ocr_layout`, iterating over `invalid_text_indices`. For PDF documents with many text regions requiring OCR aggregation, this optimization compounds significantly. Each iteration saves ~100μs (from avoided redundant slicing), which adds up when processing documents with dozens or hundreds of text blocks.

**Performance characteristics:**
The optimization is most effective when:
- `mask` contains True values (common case where text regions overlap)
- The source_regions contain multiple elements to slice
- The function is called repeatedly in hot paths (as evidenced by the OCR merge workflow)
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash labels Jan 7, 2026
@codeflash-ai codeflash-ai bot closed this Jan 7, 2026
@codeflash-ai
Copy link
Contributor Author

codeflash-ai bot commented Jan 7, 2026

This PR has been automatically closed because the original PR #4169 by badGarnet was closed.

Base automatically changed from fix/partially-filled-inferred-layout-mark-as-not-extracted to main January 7, 2026 22:17
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr4169-2026-01-07T19.05.25 branch January 7, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants