-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix: partially filled inferred layout mark as not extracted #4169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
badGarnet
merged 9 commits into
main
from
fix/partially-filled-inferred-layout-mark-as-not-extracted
Jan 7, 2026
+79
−14
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
e86d0c4
feat: use text coverage for an inferred region to set is_extracted
badGarnet db2dc9c
fix aggregate iou computation
badGarnet e3d3894
remove debug print
badGarnet b177721
Merge remote-tracking branch 'origin/main' into fix/partially-filled-…
badGarnet 819956c
use config to set threshold
badGarnet c20ac5e
Merge remote-tracking branch 'origin/main' into fix/partially-filled-…
badGarnet 23c1451
use partial
badGarnet caaad7f
fix: fix test
badGarnet d1541dd
Merge remote-tracking branch 'origin/main' into fix/partially-filled-…
badGarnet File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| __version__ = "0.18.27-dev6" # pragma: no cover | ||
| __version__ = "0.18.27" # pragma: no cover |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚡️Codeflash found 129% (1.29x) speedup for
_aggregated_iouinunstructured/partition/pdf_image/pdfminer_processing.py⏱️ Runtime :
4.05 milliseconds→1.77 milliseconds(best of15runs)📝 Explanation and details
The optimized code achieves a 129% speedup (4.05ms → 1.77ms) by eliminating expensive function call overhead within a tight loop. The key optimizations are:
What Changed
Inlined function calls: Instead of calling
calculate_bbox_area()andcalculate_intersection_area()repeatedly in the loop (1,032+ times per execution), all area and intersection calculations are performed inline using direct arithmetic operations.Hoisted box2 unpacking: The
box2tuple is unpacked once before the loop instead of being unpacked on every iteration insidecalculate_intersection_area().Direct array indexing: Changed from slice notation
box1s[i, :]to individual element accessbox1s[i, 0],box1s[i, 1], etc., which avoids creating intermediate array slices.Why It's Faster
Function call overhead dominates: The line profiler shows that in the original code, 65.2% of time was spent in
calculate_intersection_area()calls and 30.2% incalculate_bbox_area()calls. In Python, function calls are expensive due to:By inlining these operations, the optimized version spends only 5.3% of time on intersection area calculation (now a simple inline multiplication) and eliminates the separate
calculate_bbox_area()calls entirely.Reduced tuple operations: The original code unpacked
bbox21,032 times insidecalculate_intersection_area(). The optimized version does this once, saving thousands of tuple unpacking operations.Impact on Workloads
Based on the
function_references,_aggregated_iou()is called fromaggregate_embedded_text_by_block()during PDF text extraction - once per target region to compute IoU between embedded text bounding boxes and layout blocks. Given that:box1sparameter)The optimization provides meaningful speedup in document-heavy workloads. The annotated tests confirm the optimization scales well:
The speedup is consistent across all test scenarios because the bottleneck (function call overhead) is eliminated regardless of input size, making this particularly valuable for PDFs with many text regions.
✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To test or edit this optimization locally
git merge codeflash/optimize-pr4169-2026-01-07T19.01.25Click to see suggested changes