Fix: partially filled inferred layout mark as not extracted #4169

badGarnet · 2026-01-07T16:27:08Z

This PR fixes an issue where elements with partially filled extracted text is marked as extracted.

bug scenario

This PR adds a new unit test to show case a scenario:

during merging inferred and extracted layout the function aggregate_embedded_text_by_block aggregates extracted text that falls into an inferred element; and if all text has the flag is_extracted being "true" the inferred element is marked as such as well
however, there can be a case where the extracted text only partially fills the inferred element. There might be text in the inferred element region that are not present as extracted text (i.e., require OCR). But the current logic would still mark this inferred element as is_extracted = "true"

Fix

The fix adds another check in the function aggregate_embedded_text_by_block: if the intersect over union of between the source regions and target region cross a given threshold. This new check correctly identifies the case in the unit test that the inferred element should be be marked a is_extracted = "false".

…inferred-layout-mark-as-not-extracted

badGarnet · 2026-01-07T18:16:44Z

@claude review

claude · 2026-01-07T18:17:04Z

Claude finished @badGarnet's task —— View job

Code Review for Partially Filled Inferred Layout Fix

Todo List

Read repository guidelines (CLAUDE.md)
Analyze the main logic changes in pdfminer_processing.py
Review the new test case implementation
Examine supporting changes in other files
Provide comprehensive review feedback

Review Summary

This PR effectively addresses a critical bug in the text extraction logic where inferred layout elements were incorrectly marked as is_extracted="true" when they only contained partial extracted text coverage. The solution adds an IoU (Intersection over Union) threshold check to ensure adequate coverage before marking elements as extracted.

Detailed Review

✅ Strengths

Clear Problem Definition: The bug scenario is well-explained with concrete test cases demonstrating the issue.
Sound Technical Solution: The addition of IoU threshold (embed_region_threshold=0.25) provides a quantitative measure of coverage adequacy.
Good Test Coverage: The new test test_aggregate_only_partially_fill_target() effectively demonstrates the fix with a small extracted region (20x20) within a large target region (300x300).
Proper API Evolution: The parameter renaming (threshold → subregion_threshold) and addition of embed_region_threshold improves API clarity.

🔍 Technical Analysis

Core Logic Change (unstructured/partition/pdf_image/pdfminer_processing.py:816-825):

if sum(mask):
    source_bboxes = source_regions.slice(mask).element_coords
    target_bboxes = target_region.element_coords
    
    iou = _aggregated_iou(source_bboxes, target_bboxes[0, :])
    
    is_extracted = (
        all(flag == IsExtracted.TRUE for flag in source_regions.slice(mask).is_extracted_array)
        and iou > embed_region_threshold  # New IoU check
    )

The logic now requires both conditions:

All source regions are marked as extracted (IsExtracted.TRUE)
IoU between aggregated source regions and target region exceeds 25%

IoU Calculation (_aggregated_iou function at lines 777-789):

Computes union of multiple source bboxes against single target bbox
Handles edge case where union=0 by returning 1.0
Uses proper IoU formula: intersection / union

⚠️ Concerns & Questions

Magic Number: The embed_region_threshold=0.25 appears to be empirically chosen. Consider:
- Adding documentation explaining why 25% was selected
- Making this configurable via environment config for future tuning
IoU Calculation Edge Case: In _aggregated_iou(), when union == 0, the function returns 1.0. While mathematically reasonable, this could mask potential issues where bboxes have zero area. Consider logging this case.
Performance: The new IoU calculation adds computational overhead. For large documents, this could impact performance. Consider profiling if this becomes a bottleneck.
Parameter Naming: The embed_region_threshold parameter name could be more descriptive (e.g., min_coverage_ratio or text_coverage_threshold).

🧪 Test Quality

The test case effectively demonstrates the fix:

def test_aggregate_only_partially_fill_target():
    # Small extracted region (20x20) in large target (300x300)  
    # IoU ≈ 400/(400+90000-400) ≈ 0.004 < 0.25 threshold
    # Should result in is_extracted="false"

Suggestions for additional testing:

Test with IoU exactly at threshold boundary (0.25)
Test with multiple small regions that together exceed threshold
Test with overlapping source regions

📝 Minor Issues

Docstring Update: The docstring for aggregate_embedded_text_by_block should document the new embed_region_threshold parameter.
Import Consistency: Verify that all imports are still used after the parameter renaming.

Recommendation: ✅ APPROVE

This is a well-implemented fix for a legitimate bug. The IoU threshold approach is sound and the test coverage adequately demonstrates the fix. The minor concerns raised are suggestions for future improvements rather than blocking issues.

The fix correctly identifies cases where extracted text only partially fills an inferred region and appropriately marks such elements as requiring OCR (is_extracted="false").

…inferred-layout-mark-as-not-extracted

codeflash-ai · 2026-01-07T19:01:26Z

unstructured/partition/pdf_image/pdfminer_processing.py

+    intersection = 0.0
+    sum_areas = calculate_bbox_area(box2)
+
+    for i in range(box1s.shape[0]):
+        intersection += calculate_intersection_area(box1s[i, :], box2)
+        sum_areas += calculate_bbox_area(box1s[i, :])
+


⚡️Codeflash found 129% (1.29x) speedup for _aggregated_iou in unstructured/partition/pdf_image/pdfminer_processing.py

⏱️ Runtime : 4.05 milliseconds → 1.77 milliseconds (best of 15 runs)

📝 Explanation and details

The optimized code achieves a 129% speedup (4.05ms → 1.77ms) by eliminating expensive function call overhead within a tight loop. The key optimizations are:

What Changed

Inlined function calls: Instead of calling calculate_bbox_area() and calculate_intersection_area() repeatedly in the loop (1,032+ times per execution), all area and intersection calculations are performed inline using direct arithmetic operations.

Hoisted box2 unpacking: The box2 tuple is unpacked once before the loop instead of being unpacked on every iteration inside calculate_intersection_area().

Direct array indexing: Changed from slice notation box1s[i, :] to individual element access box1s[i, 0], box1s[i, 1], etc., which avoids creating intermediate array slices.

Why It's Faster

Function call overhead dominates: The line profiler shows that in the original code, 65.2% of time was spent in calculate_intersection_area() calls and 30.2% in calculate_bbox_area() calls. In Python, function calls are expensive due to:

Stack frame creation/destruction

Argument passing and tuple unpacking

Name resolution in function scope

By inlining these operations, the optimized version spends only 5.3% of time on intersection area calculation (now a simple inline multiplication) and eliminates the separate calculate_bbox_area() calls entirely.

Reduced tuple operations: The original code unpacked bbox2 1,032 times inside calculate_intersection_area(). The optimized version does this once, saving thousands of tuple unpacking operations.

Impact on Workloads

Based on the function_references, _aggregated_iou() is called from aggregate_embedded_text_by_block() during PDF text extraction - once per target region to compute IoU between embedded text bounding boxes and layout blocks. Given that:

PDFs commonly have dozens to hundreds of layout blocks

Each call processes all overlapping text regions (the box1s parameter)

This happens during document parsing, a core operation

The optimization provides meaningful speedup in document-heavy workloads. The annotated tests confirm the optimization scales well:

Small inputs (1-3 boxes): 60-104% faster

Medium inputs (100 boxes): 135-145% faster

Large inputs (500 boxes): 128% faster

The speedup is consistent across all test scenarios because the bottleneck (function call overhead) is eliminated regardless of input size, making this particularly valuable for PDFs with many text regions.

✅ Correctness verification report:

Test Status

⏪ Replay Tests 🔘 None Found

⚙️ Existing Unit Tests 🔘 None Found

🔎 Concolic Coverage Tests 🔘 None Found

🌀 Generated Regression Tests ✅ 29 Passed

📊 Tests Coverage 100.0%

🌀 Click to see Generated Regression Tests

# imports from unstructured.partition.pdf_image.pdfminer_processing import _aggregated_iou # Helper to create a "box1s" array for tests, since the function expects .shape and [i, :] class DummyArray: def __init__(self, boxes): self._boxes = boxes self.shape = (len(boxes), 4) def __getitem__(self, idx): return self._boxes[idx] # ------------------------- # Basic Test Cases # ------------------------- def test_multiple_boxes_all_perfect_overlap(): # Multiple boxes, all perfectly overlap with box2 box1 = (0, 0, 2, 2) box2 = (0, 0, 2, 2) box3 = (0, 0, 2, 2) box2_main = (0, 0, 2, 2) box1s = DummyArray([box1, box2, box3]) # ------------------------- # Edge Test Cases # ------------------------- def test_empty_box1s(): # Edge: box1s is empty box1s = DummyArray([]) box2 = (0, 0, 2, 2) codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 2.71μs -> 2.42μs (11.6% faster)

import numpy as np # needed for array operations # imports # function to test from unstructured.partition.pdf_image.pdfminer_processing import ( _aggregated_iou, ) # unit tests # ============================================================================ # BASIC FUNCTIONALITY TEST CASES # ============================================================================ def test_single_box_complete_overlap(): """Test IoU when box1s contains one box identical to box2 (should return 1.0)""" # Create a single box in box1s that is identical to box2 box1s = np.array([[0.0, 0.0, 10.0, 10.0]]) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 19.6μs -> 12.2μs (60.5% faster) def test_multiple_boxes_varying_overlaps(): """Test IoU with multiple boxes having different overlap amounts""" # Create multiple boxes with different overlaps box1s = np.array( [ [0.0, 0.0, 10.0, 10.0], # complete overlap [5.0, 0.0, 15.0, 10.0], # partial overlap [20.0, 20.0, 30.0, 30.0], # no overlap ] ) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 21.2μs -> 10.4μs (104% faster) # Intersection: 100 (first) + 50 (second) + 0 (third) = 150 # Sum areas: 100 (box2) + 100 + 100 + 100 = 400 # Union: 400 - 150 = 250 # IoU: 150/250 = 0.6 expected = 150.0 / 250.0 # ============================================================================ # EDGE CASES # ============================================================================ def test_empty_box1s_array(): """Test IoU when box1s is an empty array with shape (0, 4)""" # Create empty array box1s = np.empty((0, 4)) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 2.35μs -> 2.33μs (0.900% faster) def test_zero_area_box_in_box1s(): """Test IoU when a box in box1s has zero area (x1==x2)""" # Create box1s with a zero-area box box1s = np.array([[5.0, 5.0, 5.0, 10.0]]) # width = 0 box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.9μs -> 6.19μs (91.7% faster) def test_zero_area_box2(): """Test IoU when box2 has zero area""" # Create normal box1s box1s = np.array([[0.0, 0.0, 10.0, 10.0]]) box2 = (5.0, 5.0, 5.0, 10.0) # width = 0 # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.5μs -> 6.30μs (82.5% faster) def test_both_boxes_zero_area(): """Test IoU when both box1s and box2 have zero area""" # Create zero-area boxes box1s = np.array([[5.0, 5.0, 5.0, 5.0]]) box2 = (3.0, 3.0, 3.0, 3.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.6μs -> 6.03μs (91.8% faster) def test_touching_boxes_no_overlap(): """Test IoU when boxes touch at an edge but don't overlap""" # Create boxes that share an edge box1s = np.array([[10.0, 0.0, 20.0, 10.0]]) # touches right edge of box2 box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.4μs -> 6.11μs (86.9% faster) def test_negative_coordinates(): """Test IoU with boxes having negative coordinates""" # Create boxes with negative coordinates box1s = np.array([[-10.0, -10.0, 0.0, 0.0]]) box2 = (-5.0, -5.0, 5.0, 5.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.9μs -> 6.28μs (89.3% faster) # Intersection area: overlap from (-5,-5) to (0,0) = 25 # Union: 100 (box1s) + 100 (box2) - 25 = 175 # IoU = 25/175 = 1/7 expected = 25.0 / 175.0 def test_very_small_boxes(): """Test IoU with very small boxes (minimal dimensions)""" # Create very small boxes box1s = np.array([[0.0, 0.0, 0.1, 0.1]]) box2 = (0.05, 0.05, 0.15, 0.15) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.8μs -> 6.40μs (83.7% faster) # Intersection: overlap from (0.05, 0.05) to (0.1, 0.1) = 0.0025 # Union: 0.01 (box1s) + 0.01 (box2) - 0.0025 = 0.0175 # IoU = 0.0025/0.0175 = 1/7 expected = 0.0025 / 0.0175 # ============================================================================ # MATHEMATICAL EDGE CASES # ============================================================================ def test_perfect_union_iou_equals_one(): """Test that identical boxes give IoU = 1.0""" # Create identical boxes box1s = np.array([[5.0, 5.0, 15.0, 15.0]]) box2 = (5.0, 5.0, 15.0, 15.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.6μs -> 6.17μs (87.2% faster) def test_zero_intersection_iou_equals_zero(): """Test that completely separate boxes give IoU = 0.0""" # Create completely separate boxes box1s = np.array([[100.0, 100.0, 200.0, 200.0]]) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.3μs -> 5.91μs (90.7% faster) def test_multiple_boxes_covering_box2_completely(): """Test IoU when multiple boxes in box1s together cover all of box2""" # Create boxes that together cover box2 box1s = np.array([[0.0, 0.0, 5.0, 10.0], [5.0, 0.0, 10.0, 10.0]]) # left half # right half box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 16.2μs -> 8.13μs (98.9% faster) def test_overlapping_boxes_in_box1s(): """Test IoU when boxes in box1s overlap with each other""" # Create overlapping boxes in box1s box1s = np.array([[0.0, 0.0, 10.0, 10.0], [5.0, 5.0, 15.0, 15.0]]) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 16.2μs -> 8.46μs (92.1% faster) # Intersection with box2: 100 (first) + 25 (second) = 125 # Sum areas: 100 (box2) + 100 (first) + 100 (second) = 300 # Union: 300 - 125 = 175 # IoU = 125/175 = 5/7 expected = 125.0 / 175.0 def test_multiple_identical_boxes_in_box1s(): """Test IoU when box1s contains multiple identical boxes""" # Create multiple identical boxes box1s = np.array([[0.0, 0.0, 10.0, 10.0], [0.0, 0.0, 10.0, 10.0], [0.0, 0.0, 10.0, 10.0]]) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 19.9μs -> 9.91μs (101% faster) # Intersection: 100 + 100 + 100 = 300 # Sum areas: 100 (box2) + 100 + 100 + 100 = 400 # Union: 400 - 300 = 100 # IoU = 300/100 = 3.0 expected = 300.0 / 100.0 # ============================================================================ # LARGE SCALE TEST CASES # ============================================================================ def test_many_boxes_no_overlap(): """Test IoU with 100 boxes that don't overlap box2""" # Create 100 non-overlapping boxes box1s = np.array( [[i * 20.0, i * 20.0, i * 20.0 + 10.0, i * 20.0 + 10.0] for i in range(1, 101)] ) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 351μs -> 146μs (141% faster) def test_many_boxes_all_overlapping(): """Test IoU with 100 boxes all overlapping box2""" # Create 100 boxes all overlapping with box2 box1s = np.array([[0.0, 0.0, 10.0, 10.0] for _ in range(100)]) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 384μs -> 163μs (135% faster) # Intersection: 100 * 100 = 10000 # Sum areas: 100 + 100*100 = 10100 # Union: 10100 - 10000 = 100 # IoU = 10000/100 = 100.0 expected = 10000.0 / 100.0 def test_many_boxes_partial_overlap(): """Test IoU with 200 boxes having varying partial overlaps""" # Create 200 boxes with different overlaps box1s_list = [] for i in range(200): # Create boxes that shift slightly each time x_offset = i * 0.1 box1s_list.append([x_offset, 0.0, x_offset + 10.0, 10.0]) box1s = np.array(box1s_list) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 736μs -> 307μs (139% faster) def test_large_coordinate_values(): """Test IoU with very large coordinate values""" # Create boxes with large coordinates box1s = np.array([[1e6, 1e6, 1e6 + 100, 1e6 + 100]]) box2 = (1e6 + 50, 1e6 + 50, 1e6 + 150, 1e6 + 150) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 12.0μs -> 6.53μs (83.3% faster) # Intersection: overlap from (1e6+50, 1e6+50) to (1e6+100, 1e6+100) = 2500 # Union: 10000 + 10000 - 2500 = 17500 # IoU = 2500/17500 = 1/7 expected = 2500.0 / 17500.0 def test_high_precision_coordinates(): """Test IoU with high precision decimal coordinates""" # Create boxes with many decimal places box1s = np.array([[0.123456789, 0.987654321, 10.123456789, 10.987654321]]) box2 = (5.111111111, 5.222222222, 15.333333333, 15.444444444) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 11.6μs -> 6.35μs (83.0% faster) def test_mixed_large_and_small_boxes(): """Test IoU with a mix of very large and very small boxes""" # Create mix of large and small boxes box1s = np.array( [ [0.0, 0.0, 1000.0, 1000.0], # large box [5.0, 5.0, 5.01, 5.01], # tiny box [100.0, 100.0, 200.0, 200.0], # medium box ] ) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 20.3μs -> 10.1μs (101% faster) def test_many_boxes_grid_pattern(): """Test IoU with boxes arranged in a grid pattern""" # Create a 10x10 grid of boxes box1s_list = [] for i in range(10): for j in range(10): x1 = i * 5.0 y1 = j * 5.0 x2 = x1 + 3.0 y2 = y1 + 3.0 box1s_list.append([x1, y1, x2, y2]) box1s = np.array(box1s_list) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 357μs -> 145μs (145% faster) def test_boxes_with_fractional_overlap(): """Test IoU with boxes having various fractional overlaps""" # Create boxes with specific fractional overlaps box1s = np.array( [ [0.0, 0.0, 5.0, 10.0], # 50% overlap [0.0, 0.0, 2.5, 10.0], # 25% overlap [0.0, 0.0, 7.5, 10.0], # 75% overlap [0.0, 0.0, 1.0, 10.0], # 10% overlap ] ) box2 = (0.0, 0.0, 10.0, 10.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 23.7μs -> 11.6μs (104% faster) # Intersection: 50 + 25 + 75 + 10 = 160 # Sum areas: 100 + 50 + 25 + 75 + 10 = 260 # Union: 260 - 160 = 100 # IoU = 160/100 = 1.6 expected = 160.0 / 100.0 def test_performance_with_500_boxes(): """Test performance and correctness with 500 boxes""" # Create 500 boxes with varying positions np.random.seed(42) # for reproducibility box1s = np.random.rand(500, 4) * 100 # Ensure x1 < x2 and y1 < y2 for i in range(500): if box1s[i, 0] > box1s[i, 2]: box1s[i, 0], box1s[i, 2] = box1s[i, 2], box1s[i, 0] if box1s[i, 1] > box1s[i, 3]: box1s[i, 1], box1s[i, 3] = box1s[i, 3], box1s[i, 1] box2 = (25.0, 25.0, 75.0, 75.0) # Calculate IoU codeflash_output = _aggregated_iou(box1s, box2) result = codeflash_output # 1.91ms -> 837μs (128% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr4169-2026-01-07T19.01.25

Click to see suggested changes

Suggested change

intersection = 0.0

sum_areas = calculate_bbox_area(box2)

for i in range(box1s.shape[0]):

intersection += calculate_intersection_area(box1s[i, :], box2)

sum_areas += calculate_bbox_area(box1s[i, :])

# Extract box2 coordinates once

x1_2, y1_2, x2_2, y2_2 = box2

box2_area = (x2_2 - x1_2) * (y2_2 - y1_2)

intersection = 0.0

sum_areas = box2_area

n = box1s.shape[0]

for i in range(n):

# Direct indexing instead of slicing

x1_1 = box1s[i, 0]

y1_1 = box1s[i, 1]

x2_1 = box1s[i, 2]

y2_1 = box1s[i, 3]

# Inline bbox area calculation

sum_areas += (x2_1 - x1_1) * (y2_1 - y1_1)

# Inline intersection calculation

x_intersection = max(x1_1, x1_2)

y_intersection = max(y1_1, y1_2)

x2_intersection = min(x2_1, x2_2)

y2_intersection = min(y2_1, y2_2)

if x_intersection < x2_intersection and y_intersection < y2_intersection:

intersection += (x2_intersection - x_intersection) * (y2_intersection - y_intersection)

codeflash-ai · 2026-01-07T19:02:11Z

⚡️ Codeflash found optimizations for this PR

📄 142% (1.42x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

⏱️ Runtime : 4.05 milliseconds → 1.67 milliseconds (best of 5 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function merge_out_layout_with_ocr_layout by 142% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4170

If you approve, it will be merged into this PR (branch fix/partially-filled-inferred-layout-mark-as-not-extracted).

codeflash-ai · 2026-01-07T19:05:08Z

⚡️ Codeflash found optimizations for this PR

📄 37% (0.37x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

⏱️ Runtime : 229 milliseconds → 167 milliseconds (best of 6 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function merge_out_layout_with_ocr_layout by 37% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4171

If you approve, it will be merged into this PR (branch fix/partially-filled-inferred-layout-mark-as-not-extracted).

codeflash-ai · 2026-01-07T19:05:34Z

⚡️ Codeflash found optimizations for this PR

📄 15% (0.15x) speedup for `aggregate_embedded_text_by_block` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⏱️ Runtime : 377 microseconds → 328 microseconds (best of 8 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function aggregate_embedded_text_by_block by 15% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4172

If you approve, it will be merged into this PR (branch fix/partially-filled-inferred-layout-mark-as-not-extracted).

…inferred-layout-mark-as-not-extracted

badGarnet added 3 commits January 6, 2026 20:05

feat: use text coverage for an inferred region to set is_extracted

e86d0c4

fix aggregate iou computation

db2dc9c

remove debug print

e3d3894

badGarnet marked this pull request as ready for review January 7, 2026 17:21

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

b177721

…inferred-layout-mark-as-not-extracted

use config to set threshold

819956c

cragwolfe approved these changes Jan 7, 2026

View reviewed changes

badGarnet added 2 commits January 7, 2026 12:45

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

c20ac5e

…inferred-layout-mark-as-not-extracted

use partial

23c1451

codeflash-ai bot reviewed Jan 7, 2026

View reviewed changes

codeflash-ai bot mentioned this pull request Jan 7, 2026

⚡️ Speed up function merge_out_layout_with_ocr_layout by 142% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4170

Closed

codeflash-ai bot mentioned this pull request Jan 7, 2026

⚡️ Speed up function merge_out_layout_with_ocr_layout by 37% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4171

Closed

codeflash-ai bot mentioned this pull request Jan 7, 2026

⚡️ Speed up function aggregate_embedded_text_by_block by 15% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4172

Closed

badGarnet added 2 commits January 7, 2026 15:11

fix: fix test

caaad7f

Merge remote-tracking branch 'origin/main' into fix/partially-filled-…

d1541dd

…inferred-layout-mark-as-not-extracted

badGarnet merged commit e3c4b52 into main Jan 7, 2026
38 of 39 checks passed

badGarnet deleted the fix/partially-filled-inferred-layout-mark-as-not-extracted branch January 7, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: partially filled inferred layout mark as not extracted #4169

Fix: partially filled inferred layout mark as not extracted #4169

Uh oh!

badGarnet commented Jan 7, 2026 •

edited by cragwolfe

Loading

Uh oh!

badGarnet commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

codeflash-ai bot Jan 7, 2026

Uh oh!

codeflash-ai bot commented Jan 7, 2026

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

Uh oh!

codeflash-ai bot commented Jan 7, 2026

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 37% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4171

Uh oh!

codeflash-ai bot commented Jan 7, 2026

⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Test	Status
⏪ Replay Tests	🔘 None Found
⚙️ Existing Unit Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 29 Passed
📊 Tests Coverage	100.0%

-    intersection = 0.0
-    sum_areas = calculate_bbox_area(box2)
-    for i in range(box1s.shape[0]):
-        intersection += calculate_intersection_area(box1s[i, :], box2)
-        sum_areas += calculate_bbox_area(box1s[i, :])
+    # Extract box2 coordinates once
+    x1_2, y1_2, x2_2, y2_2 = box2
+    box2_area = (x2_2 - x1_2) * (y2_2 - y1_2)
+    intersection = 0.0
+    sum_areas = box2_area
+    n = box1s.shape[0]
+    for i in range(n):
+        # Direct indexing instead of slicing
+        x1_1 = box1s[i, 0]
+        y1_1 = box1s[i, 1]
+        x2_1 = box1s[i, 2]
+        y2_1 = box1s[i, 3]
+        # Inline bbox area calculation
+        sum_areas += (x2_1 - x1_1) * (y2_1 - y1_1)
+        # Inline intersection calculation
+        x_intersection = max(x1_1, x1_2)
+        y_intersection = max(y1_1, y1_2)
+        x2_intersection = min(x2_1, x2_2)
+        y2_intersection = min(y2_1, y2_2)
+        if x_intersection < x2_intersection and y_intersection < y2_intersection:
+            intersection += (x2_intersection - x_intersection) * (y2_intersection - y_intersection)

Fix: partially filled inferred layout mark as not extracted #4169

Fix: partially filled inferred layout mark as not extracted #4169

Uh oh!

Conversation

badGarnet commented Jan 7, 2026 • edited by cragwolfe Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

bug scenario

Fix

Uh oh!

badGarnet commented Jan 7, 2026

Uh oh!

claude bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for Partially Filled Inferred Layout Fix

Todo List

Review Summary

Detailed Review

✅ Strengths

🔍 Technical Analysis

⚠️ Concerns & Questions

🧪 Test Quality

📝 Minor Issues

Recommendation: ✅ APPROVE

Uh oh!

codeflash-ai bot Jan 7, 2026

Choose a reason for hiding this comment

⚡️Codeflash found 129% (1.29x) speedup for _aggregated_iou in unstructured/partition/pdf_image/pdfminer_processing.py

What Changed

Why It's Faster

Impact on Workloads

Uh oh!

codeflash-ai bot commented Jan 7, 2026

⚡️ Codeflash found optimizations for this PR

📄 142% (1.42x) speedup for merge_out_layout_with_ocr_layout in unstructured/partition/pdf_image/ocr.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function merge_out_layout_with_ocr_layout by 142% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4170

Uh oh!

codeflash-ai bot commented Jan 7, 2026

⚡️ Codeflash found optimizations for this PR

📄 37% (0.37x) speedup for merge_out_layout_with_ocr_layout in unstructured/partition/pdf_image/ocr.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function merge_out_layout_with_ocr_layout by 37% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4171

Uh oh!

codeflash-ai bot commented Jan 7, 2026

⚡️ Codeflash found optimizations for this PR

📄 15% (0.15x) speedup for aggregate_embedded_text_by_block in unstructured/partition/pdf_image/pdfminer_processing.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function aggregate_embedded_text_by_block by 15% in PR #4169 (fix/partially-filled-inferred-layout-mark-as-not-extracted) #4172

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

badGarnet commented Jan 7, 2026 •

edited by cragwolfe

Loading

claude bot commented Jan 7, 2026 •

edited

Loading

⚡️Codeflash found 129% (1.29x) speedup for `_aggregated_iou` in `unstructured/partition/pdf_image/pdfminer_processing.py`

📄 142% (1.42x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 142% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4170

📄 37% (0.37x) speedup for `merge_out_layout_with_ocr_layout` in `unstructured/partition/pdf_image/ocr.py`

⚡️ Speed up function `merge_out_layout_with_ocr_layout` by 37% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4171

📄 15% (0.15x) speedup for `aggregate_embedded_text_by_block` in `unstructured/partition/pdf_image/pdfminer_processing.py`

⚡️ Speed up function `aggregate_embedded_text_by_block` by 15% in PR #4169 (`fix/partially-filled-inferred-layout-mark-as-not-extracted`) #4172