-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docling Produces Unreadable Text Output for PDFs #960
Comments
TLDR
Hello! We had exactly the same problem, and I can't say that I understood the problem on the library side It all started when we noticed some incorrect content in our docling documents, which just broke main flow in our app, after some tests we realised that this issue might not be repeated, depending on the hardware. |
@Fan4ik20 Thanks for the suggestion. I tried it but the results were the same. I also tried the latest version. In short, I reproduced the problem on my end also with the following versions
and
|
@josk0 This might be a problem with the docling-parse. I will investigate. PS: for some reason, when I click on your link, I am not able to download the files. Would be of great help if you could just upload them straight into the issue. |
@josk0 No need to apologize, thanks so much for the issue and examples so we can fix these issues! |
@josk0 First observations
I think this is resolved in this PR (DS4SD/docling-parse#101). If you run,
you get the following output,
![]()
|
Thanks. I haven't used poetry before. If you'd prefer me to confirm that I get this output on my end as well, let me know. Happy to figure it out. On two.pdf:
|
The more the better, but for now, let me see if there is an "easy" fix. I will be out next week, but feel free to bug my colleagues for updates on when this PR (DS4SD/docling-parse#101) will be merged (at least it solves 1 problem). |
Even with that PR, I still have problems with the one.pdf. I tried it again with the latest version (your PR was merged for the latest docling-parse release) and the error persists. Docling Version
As before, the abstract extracts well, but then a wall of GLYPHS. Fun fact? When I use macOS preview app to remove pages from the one.pdf, the resulting "shortened" version converts OK, both on the old as well as on the new version of docling(-parse)... About more PDFs for testing |
I tried it and seems to work well with the PyPdfiumDocumentBackend. Did you try it with that? from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
|
Thanks. Yes with the PyPdfiumDocumentBackend some files, including the one.pdf from the sample, are converted correctly! Others are still problematic. The output is different but similarly unreadable. |
Ok. I am also working on a platform and planning to use docling as the backend for document parsing. Wondering how such variety of files and content can be handled in the best way possible. |
How much time it takes to convert and export to markdown, do you guys have any strategy? Because in my system it takes around 2mins to convert a 9 page PDF. Other OCR packages takes only few milliseconds to do this stuff. |
@vishaldasnewtide You do have some reference numbers here https://arxiv.org/pdf/2501.17887. We also breakdown the different (optional) steps in the pipeline. For example, when not needed, we suggest deactivating OCR since it often takes a 3x factor.
@josk0 I confirm "Docling Parse version: 3.4.0" is the latest one with quite some fixes. If this is not yet solving your issues, we will try to address in the next round. Feel free to provide more problematic docs.
@nikhildigde just making you aware of the docling-serve project, where we are aggregating the multiple approaches exposing Docling as a service. It might be useful for your use case. More features like async processing are coming soon. |
@dolfim-ibm yes we are already using docling-serve as a http layer for docling. |
@dolfim-ibm Thanks for the update. Currently, I am getting some pydantic errors for CSV processing. I have installed the latest version of docling which is V2.23.
|
Note: issue here is similar to #185
Bug
I am trying to convert several PDFs of academic papers, books, etc.
For some PDFs, docling produces gibberish in converting them to markdown. You find two samples here
Short of the conversation working successfully, is there a way to identify PDFs that are problematic? This would allow me to skip them, set them aside, or do a
OcrOptions.force_full_page_ocr
if that helps.Steps to reproduce
Using the example code from GitHub
for one.pdf the output looks like this
GLYPH<28>GLYPH<27>GLYPH<26> GLYPH<25>GLYPH<24>GLYPH<28>GLYPH<23>GLYPH<22>GLYPH<21>GLYPH<20> GLYPH<25>GLYPH<19>GLYPH<20>
for two.pdf the output looks like this
2-8[ 5O@QQ[=LLGQ[J<Z[=@[MTO>D<Q@?[<R[QM@>F<H[NT<KRERZ[?FQ>LTKRQ[BLP[=TQEK@QQ[LO[Q<I@Q[ MOLJLRELK<I[ TQ@glyph<c=19,font=/AAAAAH+Fd3270>[ *LP[ FKBLOJ<RELKglyph<c=9,font=/AAAAAH+Fd3270>[ MH@<Q@[ @J<EH[ QM@>E<I;Q<H@Q%JERMO@QQglyph<c=21,font=/AAAAAH+Fd3270>JERglyph<c=19,font=/AAAAAH+Fd3270>@?T[ LO[ XOER@[ RL[ 7M@>F<H[ 7<H@Q[ )@M<PRJ@KRglyph<c=9,font=/AAAAAH+Fd3270>[ 8D@[ 2-8[ 5O@QQglyph<c=9,font=/AAAAAH+Fd3270>[ glyph<c=29,font=/AAAAAH+Fd3270>glyph<c=29,font=/AAAAAH+Fd3270>[ ,<ZX<O?[ 7RO@@Rglyph<c=8,font=/AAAAAH+Fd3270>[ (<J=PF?C@glyph<c=10,font=/AAAAAH+Fd3270>[ 2&[glyph<c=23,font=/AAAAAH+Fd3270>glyph<c=26,font=/AAAAAH+Fd3270>glyph<c=24,font=/AAAAAH+Fd3270>glyph<c=28,font=/AAAAAH+Fd3270>glyph<c=26,font=/AAAAAH+Fd3270>glyph<c=22,font=/AAAAAH+Fd3270>[
Docling version
Docling version: 2.20.0
Docling Core version: 2.17.2
Docling IBM Models version: 3.3.1
Docling Parse version: 3.3.0
Python: cpython-313 (3.13.1)
Platform: macOS-15.3.1-arm64-arm-64bit-Mach-O
Python version
Python 3.13.1
The text was updated successfully, but these errors were encountered: