Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images do not get into the chunks #1094

Open
iLeonidze opened this issue Mar 2, 2025 · 1 comment
Open

Images do not get into the chunks #1094

iLeonidze opened this issue Mar 2, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@iLeonidze
Copy link

Bug

Hierachical Chunker does not provide any options to enable images traversal for document iterator

Steps to reproduce

  1. Parse the doc with images and OCR enabled
  2. Try to make chunks with hybrid chunker
    chunker = HybridChunker(
        merge_peers = True
    )
    chunks = chunker.chunk(doc)

ER: images text presented in chunks
AR: no any image text in chunks provided

Rootcause of this issue is in docling_core.transforms.chunker.hierarchical_chunker.HierarchicalChunker.chunk:

for item, level in dl_doc.iterate_items():

By default iterate_items() do not traverse images and it should be enabled manually, but there is no way to enable images traversal from chunker to be passed to iterate_items() method.
Editing this line the following way:

for item, level in dl_doc.iterate_items(traverse_pictures=True):

solves the issue and images OCR text presented in chunks.

Please provide correct chunk options passed to iterate_items() or make images traversal enabled by default.

Docling version

Docling version: 2.25.0
Docling Core version: 2.21.1
Docling IBM Models version: 3.4.1
Docling Parse version: 3.3.1
Python: cpython-312 (3.12.9)
Platform: Windows-11-10.0.22621-SP0

Python version

Python 3.12.9
@iLeonidze iLeonidze added the bug Something isn't working label Mar 2, 2025
@tgillam
Copy link

tgillam commented Mar 7, 2025

I also would like to see this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants