Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Add ability to detect bounding boxes for compound images during extraction #353

Open
12 tasks
drobison00 opened this issue Jan 20, 2025 · 0 comments
Open
12 tasks
Assignees
Labels
feature request New feature or request

Comments

@drobison00
Copy link
Collaborator

drobison00 commented Jan 20, 2025

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Significant improvement

Please provide a clear description of problem this feature solves

Description

During the extraction phase for PDFs, PPTx, and Docx, we often encounter situations where we have a number of small images that collectively make up a compound image. Currently, we treat all of these images independently, returning each as its own primitive in the extraction results.

We want to improve this behavior and make it configurable, so that it is possible to preserve the existing behavior, or to instruct the extraction phase to attempt to identify a single bounding box for a set of connected images.

Note

  • We do not know a-priori if a collection of images is part of a single image.
  • There may be multiple distinct collections on a page.

Describe the feature, and optionally a solution or implementation and any alternatives

This issue aims to define an approach to:

  1. Identify clusters of bounding boxes that belong to the same compound image.
  2. Compute the overall bounding box for each identified compound image.

Required Behavior

  • Group bounding boxes that belong to the same compound image.
  • Handle scenarios where bounding boxes may be close but not necessarily overlapping.
  • Compute a minimal bounding box that encapsulates all grouped bounding boxes.
  • Support configurable proximity thresholds for clustering.
  • Ensure that detected table/chart images are not included in connected component bounding boxes.
  • Output detected compound image bounding boxes.
  • Provide visualization for debugging grouped bounding boxes.
  • Efficient performance on large collections of bounding boxes.

Example Approach: Bounding Box Expansion

Steps:

  1. Initialization:

    • Identify all bounding boxes on a page.
    • Exclude table/chart bounding boxes from processing.
  2. Expansion:

    • Expand each bounding box by a configurable margin.
    • Merge overlapping or adjacent bounding boxes iteratively.
  3. Refinement:

    • Apply post-processing to fine-tune the final bounding box.
    • Remove potential over-grouping by applying size and aspect ratio constraints.
  4. Output:

    • Store and visualize final compound bounding boxes.

Example Scenarios

Scenario 1: Adjacent Boxes Forming a Compound Image

graph TD
    subgraph Compound Image 1
        A[Box 1] -->|Close to| B[Box 2]
        B -->|Close to| C[Box 3]
    end
    subgraph Compound Image 2
        D[Box 4] -- No Connection --> E[Box 5]
    end
Loading

Expected output:

  • Compound Image 1: { Box 1, Box 2, Box 3 }
  • Compound Image 2: { Box 4, Box 5 }

Scenario 2: Distant Boxes Forming Separate Images

graph TD
    A[Box 1] -- Far --> B[Box 2]
    subgraph Compound Image 1
        C[Box 3] -->|Close to| D[Box 4]
    end
Loading

Expected output:

  • Compound Image 1: { Box 3, Box 4 }
  • Compound Image 2: { Box 1 }
  • Compound Image 3: { Box 2 }

Scenario 3: Overlapping Boxes

graph TD
    subgraph Compound Image 1
        A[Box 1] -->|Overlapping| B[Box 2]
        B -->|Overlapping| C[Box 3]
    end
Loading

Expected output:

  • Compound Image 1: { Box 1, Box 2, Box 3 }

Acceptance Criteria

  • Bounding boxes are correctly grouped based on proximity and overlap.
  • Correct minimal bounding box is computed for each detected cluster.
  • Performance remains efficient with increasing numbers of bounding boxes.
  • Clustering and visualization options are configurable.

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants