You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Significant improvement
Please provide a clear description of problem this feature solves
Description
During the extraction phase for PDFs, PPTx, and Docx, we often encounter situations where we have a number of small images that collectively make up a compound image. Currently, we treat all of these images independently, returning each as its own primitive in the extraction results.
We want to improve this behavior and make it configurable, so that it is possible to preserve the existing behavior, or to instruct the extraction phase to attempt to identify a single bounding box for a set of connected images.
Note
We do not know a-priori if a collection of images is part of a single image.
There may be multiple distinct collections on a page.
Describe the feature, and optionally a solution or implementation and any alternatives
This issue aims to define an approach to:
Identify clusters of bounding boxes that belong to the same compound image.
Compute the overall bounding box for each identified compound image.
Required Behavior
Group bounding boxes that belong to the same compound image.
Handle scenarios where bounding boxes may be close but not necessarily overlapping.
Compute a minimal bounding box that encapsulates all grouped bounding boxes.
Support configurable proximity thresholds for clustering.
Ensure that detected table/chart images are not included in connected component bounding boxes.
Output detected compound image bounding boxes.
Provide visualization for debugging grouped bounding boxes.
Efficient performance on large collections of bounding boxes.
Example Approach: Bounding Box Expansion
Steps:
Initialization:
Identify all bounding boxes on a page.
Exclude table/chart bounding boxes from processing.
Expansion:
Expand each bounding box by a configurable margin.
Merge overlapping or adjacent bounding boxes iteratively.
Refinement:
Apply post-processing to fine-tune the final bounding box.
Remove potential over-grouping by applying size and aspect ratio constraints.
Output:
Store and visualize final compound bounding boxes.
Example Scenarios
Scenario 1: Adjacent Boxes Forming a Compound Image
graph TD
subgraph Compound Image 1
A[Box 1] -->|Close to| B[Box 2]
B -->|Close to| C[Box 3]
end
subgraph Compound Image 2
D[Box 4] -- No Connection --> E[Box 5]
end
Loading
Expected output:
Compound Image 1: { Box 1, Box 2, Box 3 }
Compound Image 2: { Box 4, Box 5 }
Scenario 2: Distant Boxes Forming Separate Images
graph TD
A[Box 1] -- Far --> B[Box 2]
subgraph Compound Image 1
C[Box 3] -->|Close to| D[Box 4]
end
Loading
Expected output:
Compound Image 1: { Box 3, Box 4 }
Compound Image 2: { Box 1 }
Compound Image 3: { Box 2 }
Scenario 3: Overlapping Boxes
graph TD
subgraph Compound Image 1
A[Box 1] -->|Overlapping| B[Box 2]
B -->|Overlapping| C[Box 3]
end
Loading
Expected output:
Compound Image 1: { Box 1, Box 2, Box 3 }
Acceptance Criteria
Bounding boxes are correctly grouped based on proximity and overlap.
Correct minimal bounding box is computed for each detected cluster.
Performance remains efficient with increasing numbers of bounding boxes.
Clustering and visualization options are configurable.
Additional context
The text was updated successfully, but these errors were encountered:
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Significant improvement
Please provide a clear description of problem this feature solves
Description
During the extraction phase for PDFs, PPTx, and Docx, we often encounter situations where we have a number of small images that collectively make up a compound image. Currently, we treat all of these images independently, returning each as its own primitive in the extraction results.
We want to improve this behavior and make it configurable, so that it is possible to preserve the existing behavior, or to instruct the extraction phase to attempt to identify a single bounding box for a set of connected images.
Note
Describe the feature, and optionally a solution or implementation and any alternatives
This issue aims to define an approach to:
Required Behavior
Example Approach: Bounding Box Expansion
Steps:
Initialization:
Expansion:
Refinement:
Output:
Example Scenarios
Scenario 1: Adjacent Boxes Forming a Compound Image
Expected output:
Scenario 2: Distant Boxes Forming Separate Images
Expected output:
Scenario 3: Overlapping Boxes
Expected output:
Acceptance Criteria
Additional context
The text was updated successfully, but these errors were encountered: