Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docling Produces Unreadable Text Output for PDFs #960

Open
josk0 opened this issue Feb 13, 2025 · 16 comments
Open

Docling Produces Unreadable Text Output for PDFs #960

josk0 opened this issue Feb 13, 2025 · 16 comments
Assignees
Labels
bug Something isn't working pdf parsing PDF issue related to docling-parse

Comments

@josk0
Copy link

josk0 commented Feb 13, 2025

Note: issue here is similar to #185

Bug

I am trying to convert several PDFs of academic papers, books, etc.
For some PDFs, docling produces gibberish in converting them to markdown. You find two samples here

Short of the conversation working successfully, is there a way to identify PDFs that are problematic? This would allow me to skip them, set them aside, or do a OcrOptions.force_full_page_ocr if that helps.

Steps to reproduce

Using the example code from GitHub

from docling.document_converter import DocumentConverter

source = "one.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

for one.pdf the output looks like this
GLYPH<28>GLYPH<27>GLYPH<26> GLYPH<25>GLYPH<24>GLYPH<28>GLYPH<23>GLYPH<22>GLYPH<21>GLYPH<20> GLYPH<25>GLYPH<19>GLYPH<20>

for two.pdf the output looks like this
2-8[ 5O@QQ[=LLGQ[J<Z[=@[MTO>D<Q@?[<R[QM@>F<H[NT<KRERZ[?FQ>LTKRQ[BLP[=TQEK@QQ[LO[Q<I@Q[ MOLJLRELK<I[ TQ@glyph<c=19,font=/AAAAAH+Fd3270>[ *LP[ FKBLOJ<RELKglyph<c=9,font=/AAAAAH+Fd3270>[ MH@<Q@[ @J<EH[ QM@>E<I;Q<H@Q%JERMO@QQglyph<c=21,font=/AAAAAH+Fd3270>JERglyph<c=19,font=/AAAAAH+Fd3270>@?T[ LO[ XOER@[ RL[ 7M@>F<H[ 7<H@Q[ )@M<PRJ@KRglyph<c=9,font=/AAAAAH+Fd3270>[ 8D@[ 2-8[ 5O@QQglyph<c=9,font=/AAAAAH+Fd3270>[ glyph<c=29,font=/AAAAAH+Fd3270>glyph<c=29,font=/AAAAAH+Fd3270>[ ,<ZX<O?[ 7RO@@Rglyph<c=8,font=/AAAAAH+Fd3270>[ (<J=PF?C@glyph<c=10,font=/AAAAAH+Fd3270>[ 2&[glyph<c=23,font=/AAAAAH+Fd3270>glyph<c=26,font=/AAAAAH+Fd3270>glyph<c=24,font=/AAAAAH+Fd3270>glyph<c=28,font=/AAAAAH+Fd3270>glyph<c=26,font=/AAAAAH+Fd3270>glyph<c=22,font=/AAAAAH+Fd3270>[

Docling version

Docling version: 2.20.0
Docling Core version: 2.17.2
Docling IBM Models version: 3.3.1
Docling Parse version: 3.3.0
Python: cpython-313 (3.13.1)
Platform: macOS-15.3.1-arm64-arm-64bit-Mach-O

Python version

Python 3.13.1

@josk0 josk0 added the bug Something isn't working label Feb 13, 2025
@Fan4ik20
Copy link

Fan4ik20 commented Feb 13, 2025

@josk0

TLDR
try to install following docling dependencies

docling==2.16.0
docling-core==2.15.1
docling-ibm-models==3.3.0
docling-parse==3.1.2

Hello! We had exactly the same problem, and I can't say that I understood the problem on the library side

It all started when we noticed some incorrect content in our docling documents, which just broke main flow in our app, after some tests we realised that this issue might not be repeated, depending on the hardware.
But, when testing docling-serve, I faced the problem that this service gave the correct results, after several hours of testing and trying to understand what was wrong with the converter configuration, I tried to downgrade the versions used in the project to those installed in docling-serve and it helped.
So, i hope that it can help you to resolve your problem, at least, temporary

@josk0
Copy link
Author

josk0 commented Feb 14, 2025

@Fan4ik20 Thanks for the suggestion. I tried it but the results were the same. I also tried the latest version. In short, I reproduced the problem on my end also with the following versions

Docling version: 2.21.0
Docling Core version: 2.18.1
Docling IBM Models version: 3.3.2
Docling Parse version: 3.3.1

and

Docling version: 2.16.0
Docling Core version: 2.15.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2

@PeterStaar-IBM PeterStaar-IBM added the pdf parsing PDF issue related to docling-parse label Feb 14, 2025
@PeterStaar-IBM PeterStaar-IBM self-assigned this Feb 14, 2025
@PeterStaar-IBM
Copy link
Contributor

PeterStaar-IBM commented Feb 14, 2025

@josk0 This might be a problem with the docling-parse. I will investigate.

PS: for some reason, when I click on your link, I am not able to download the files. Would be of great help if you could just upload them straight into the issue.

@josk0
Copy link
Author

josk0 commented Feb 14, 2025

Sorry, had no idea that's a possibility
one.pdf
two.pdf

@PeterStaar-IBM
Copy link
Contributor

@josk0 No need to apologize, thanks so much for the issue and examples so we can fix these issues!

@PeterStaar-IBM
Copy link
Contributor

@josk0 First observations

  1. one.pdf:

I think this is resolved in this PR (DS4SD/docling-parse#101). If you run,

poetry run python ./docling_parse/visualize.py -i /Users/taa/Downloads/one.pdf -p 1  -l error -c line --interactive --log-text

you get the following output,

(433.17, 019.59) (444.04, 019.59) (444.04, 027.06) (433.17, 027.06)      /T1_0 331
(040.15, 020.83) (108.28, 020.83) (108.28, 026.77) (040.15, 026.77)      /T1_1 Philos Phenomenol Res.
(108.28, 019.99) (168.17, 019.99) (168.17, 026.83) (108.28, 026.83)      /T1_2  2022;105:331-361.
(309.82, 019.99) (411.40, 019.99) (411.40, 026.83) (309.82, 026.83)      /T1_2 wileyonlinelibrary.com/journal/phpr
(040.54, 027.71) (110.42, 027.71) (110.42, 033.65) (040.54, 033.65)      /T1_3 Philos Phenomenol Res.
(110.42, 026.88) (152.81, 026.88) (152.81, 033.71) (110.42, 033.71)      /T1_4 2021;00:1-31.
(242.39, 026.88) (245.89, 026.88) (245.89, 033.71) (242.39, 033.71)      /T1_4
(422.53, 026.88) (426.03, 026.88) (426.03, 033.71) (422.53, 033.71)      /T1_4
(431.29, 022.98) (432.69, 022.98) (432.69, 038.02) (431.29, 038.02)      /T1_4 |
(432.69, 026.88) (434.44, 026.88) (434.44, 033.71) (432.69, 033.71)      /T1_4
(440.04, 026.60) (443.54, 026.60) (443.54, 033.81) (440.04, 033.81)      /T1_5 1
(322.19, 026.88) (423.77, 026.88) (423.77, 033.71) (322.19, 033.71)      /T1_4 wileyonlinelibrary.com/journal/phpr
(040.53, 654.79) (116.61, 654.79) (116.61, 661.62) (040.53, 661.62)      /T1_4 DOI: 10.1111/phpr.12823
(040.54, 626.86) (155.15, 626.86) (155.15, 636.12) (040.54, 636.12)      /T1_5 ORIGINAL ARTICLE
(040.54, 576.67) (263.03, 576.67) (263.03, 595.21) (040.54, 595.21)      /T1_5 Transparency is Surveillance
(040.54, 539.80) (115.26, 539.80) (115.26, 552.15) (040.54, 552.15)      /T1_5 C. Thi Nguyen
(040.54, 044.37) (200.07, 044.37) (200.07, 051.20) (040.54, 051.20)      /T1_4 © 2021 Philosophy and Phenomenological Research, Inc
(040.54, 506.52) (100.25, 506.52) (100.25, 514.34) (040.54, 514.34)      /T1_4 University of Utah
(040.54, 484.21) (096.45, 484.21) (096.45, 492.45) (040.54, 492.45)      /T1_5 Correspondence
(040.54, 473.52) (153.85, 473.52) (153.85, 481.34) (040.54, 481.34)      /T1_4 C. Thi Nguyen, University of Utah.
(040.54, 462.52) (138.08, 462.52) (138.08, 470.34) (040.54, 470.34)      /T1_4 Email: c.thi.nguyen@utah.edu
(201.16, 499.00) (238.37, 499.00) (238.37, 509.30) (201.16, 509.30)      /T1_5 Abstract
(201.16, 484.39) (228.09, 484.39) (228.09, 494.16) (201.16, 494.16)      /T1_4 In her
(228.93, 485.59) (347.83, 485.59) (347.83, 494.07) (228.93, 494.07)      /T1_3 BBC Reith Lectures on Trust
(347.83, 484.39) (441.69, 484.39) (441.69, 494.16) (347.83, 494.16)      /T1_4 ,  Onora O'Neill offers
(201.16, 469.39) (441.69, 469.39) (441.69, 479.16) (201.16, 479.16)      /T1_4 a short, but biting, criticism of transparency. People think
(201.16, 454.39) (441.71, 454.39) (441.71, 464.16) (201.16, 464.16)      /T1_4 that trust and transparency go together but in reality, says
(201.16, 439.39) (441.71, 439.39) (441.71, 449.16) (201.16, 449.16)      /T1_4 O'Neill,  they  are  deeply  opposed.  Transparency  forces
(201.16, 424.39) (439.22, 424.39) (439.22, 434.16) (201.16, 434.16)      /T1_4 people  to  conceal  their  actual  reasons  for  action  and  in-
(201.16, 409.39) (441.69, 409.39) (441.69, 419.16) (201.16, 419.16)      /T1_4 vent different  ones  for  public  consumption.  Transparency
(201.16, 394.39) (441.67, 394.39) (441.67, 404.16) (201.16, 404.16)      /T1_4 forces deception. I work out the details of her argument and
(201.16, 379.39) (441.69, 379.39) (441.69, 389.16) (201.16, 389.16)      /T1_4 worsen her conclusion. I focus on public transparency - that
(201.16, 364.39) (441.70, 364.39) (441.70, 374.16) (201.16, 374.16)      /T1_4 is, transparency to the public over expert domains. I offer
(201.16, 349.39) (361.58, 349.39) (361.58, 359.16) (201.16, 359.16)      /T1_4 two versions of the criticism. First, the
(362.16, 350.59) (439.17, 350.59) (439.17, 359.07) (362.16, 359.07)      /T1_3 epistemic intrusion
(439.17, 349.39) (441.67, 349.39) (441.67, 359.16) (439.17, 359.16)      /T1_4
(201.16, 334.39) (439.18, 334.39) (439.18, 344.16) (201.16, 344.16)      /T1_4 argument: The drive to transparency forces experts to ex-
(201.16, 319.39) (441.70, 319.39) (441.70, 329.16) (201.16, 329.16)      /T1_4 plain their reasoning to non-  experts. But expert reasons are,
(201.16, 304.39) (441.68, 304.39) (441.68, 314.16) (201.16, 314.16)      /T1_4 by  their  nature,  often  inaccessible  to  non-  experts.  So  the
(201.16, 289.39) (441.69, 289.39) (441.69, 299.16) (201.16, 299.16)      /T1_4 demand for transparency can pressure experts to act only
(201.16, 274.39) (441.69, 274.39) (441.69, 284.16) (201.16, 284.16)      /T1_4 in those ways for which they can offer public justification.
(201.16, 259.39) (251.50, 259.39) (251.50, 269.16) (201.16, 269.16)      /T1_4 Second, the
(252.68, 260.59) (320.27, 260.59) (320.27, 269.07) (252.68, 269.07)      /T1_3 intimate reasons
(320.27, 259.39) (441.70, 259.39) (441.70, 269.16) (320.27, 269.16)      /T1_4   argument: In many cases of
(201.16, 244.40) (441.69, 244.40) (441.69, 254.16) (201.16, 254.16)      /T1_4 practical  deliberation,  the  relevant  reasons  are  intimate  to
(201.16, 229.40) (441.69, 229.40) (441.69, 239.16) (201.16, 239.16)      /T1_4 a  community and not easily explicable to those who lack
(201.16, 214.40) (439.20, 214.40) (439.20, 224.16) (201.16, 224.16)      /T1_4 a particular shared background. The demand for transpar-
(201.16, 199.40) (441.69, 199.40) (441.69, 209.16) (201.16, 209.16)      /T1_4 ency, then, pressures community members to abandon the
(201.16, 184.40) (441.69, 184.40) (441.69, 194.16) (201.16, 194.16)      /T1_4 special understanding and sensitivity that arises from their
(201.16, 169.40) (441.67, 169.40) (441.67, 179.16) (201.16, 179.16)      /T1_4 particular experiences. Transparency, it turns out, is a form
(201.16, 154.40) (441.66, 154.40) (441.66, 164.16) (201.16, 164.16)      /T1_4 of surveillance. By forcing reasoning into the explicit and
(201.16, 139.40) (441.69, 139.40) (441.69, 149.16) (201.16, 149.16)      /T1_4 public  sphere,  transparency  roots  out  corruption  -  but  it
(201.16, 124.39) (439.17, 124.39) (439.17, 134.16) (201.16, 134.16)      /T1_4 also  inhibits  the  full  application  of  expert  skill,  sensitiv-
(201.16, 109.39) (441.69, 109.39) (441.69, 119.16) (201.16, 119.16)      /T1_4 ity,  and  subtle  shared  understandings.  The  difficulty  here
(201.16, 094.39) (439.19, 094.39) (439.19, 104.16) (201.16, 104.16)      /T1_4 arises from the basic fact that human knowledge vastly out-
(201.16, 079.39) (441.70, 079.39) (441.70, 089.16) (201.16, 089.16)      /T1_4 strips any individual's capacities. We all depend on experts,
Image
  1. For two.pdf, we have indeed a nasty problem. I hope we can resolve it soon, but my suspicion is that it comes from the built in OCR of the scanner. Will keep you posted.

@josk0
Copy link
Author

josk0 commented Feb 14, 2025

Thanks. I haven't used poetry before. If you'd prefer me to confirm that I get this output on my end as well, let me know. Happy to figure it out.

On two.pdf:

  • I may have more of where this came from (and potentially other problematic PDFs). Let me know if it would help if I sampled more.
  • If there is an idea for workaround to identify the nasty problems to set them aside, let me know

@PeterStaar-IBM
Copy link
Contributor

PeterStaar-IBM commented Feb 14, 2025

The more the better, but for now, let me see if there is an "easy" fix. I will be out next week, but feel free to bug my colleagues for updates on when this PR (DS4SD/docling-parse#101) will be merged (at least it solves 1 problem).

@josk0
Copy link
Author

josk0 commented Feb 18, 2025

Even with that PR, I still have problems with the one.pdf. I tried it again with the latest version (your PR was merged for the latest docling-parse release) and the error persists.

Docling Version

Docling version: 2.23.0
Docling Core version: 2.19.1
Docling IBM Models version: 3.3.2
Docling Parse version: 3.4.0
Python: cpython-313 (3.13.1)

As before, the abstract extracts well, but then a wall of GLYPHS.

Fun fact? When I use macOS preview app to remove pages from the one.pdf, the resulting "shortened" version converts OK, both on the old as well as on the new version of docling(-parse)...

About more PDFs for testing
I have about seven PDFs of different origins (and maybe 20-30 more that I need to look into) that all cause some kind of problem. Happy to share them directly (but because of copyright issues, would prefer not to upload them publicly)

@nikhildigde
Copy link

nikhildigde commented Feb 18, 2025

I tried it and seems to work well with the PyPdfiumDocumentBackend. Did you try it with that?

from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend)
    }
)

@josk0
Copy link
Author

josk0 commented Feb 18, 2025

Thanks. Yes with the PyPdfiumDocumentBackend some files, including the one.pdf from the sample, are converted correctly!

Others are still problematic. The output is different but similarly unreadable.

@nikhildigde
Copy link

Ok. I am also working on a platform and planning to use docling as the backend for document parsing. Wondering how such variety of files and content can be handled in the best way possible.

@vishaldasnewtide
Copy link

How much time it takes to convert and export to markdown, do you guys have any strategy? Because in my system it takes around 2mins to convert a 9 page PDF. Other OCR packages takes only few milliseconds to do this stuff.

@dolfim-ibm
Copy link
Contributor

How much time it takes to convert and export to markdown, do you guys have any strategy? Because in my system it takes around 2mins to convert a 9 page PDF. Other OCR packages takes only few milliseconds to do this stuff.

@vishaldasnewtide You do have some reference numbers here https://arxiv.org/pdf/2501.17887. We also breakdown the different (optional) steps in the pipeline. For example, when not needed, we suggest deactivating OCR since it often takes a 3x factor.

Even with that PR, I still have problems with the one.pdf. I tried it again with the latest version (your PR was merged for the latest docling-parse release) and the error persists.

@josk0 I confirm "Docling Parse version: 3.4.0" is the latest one with quite some fixes. If this is not yet solving your issues, we will try to address in the next round. Feel free to provide more problematic docs.

Ok. I am also working on a platform and planning to use docling as the backend for document parsing. Wondering how such variety of files and content can be handled in the best way possible.

@nikhildigde just making you aware of the docling-serve project, where we are aggregating the multiple approaches exposing Docling as a service. It might be useful for your use case. More features like async processing are coming soon.

@nikhildigde
Copy link

@dolfim-ibm yes we are already using docling-serve as a http layer for docling.

@vishaldasnewtide
Copy link

@dolfim-ibm Thanks for the update. Currently, I am getting some pydantic errors for CSV processing. I have installed the latest version of docling which is V2.23.

Value error, 'text/csv' is not a valid MIME type [type=value_error, input_value='text/csv', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pdf parsing PDF issue related to docling-parse
Projects
None yet
Development

No branches or pull requests

6 participants