Double characters/strange values in table #626

jakobdo · 2022-03-16T08:15:41Z

jakobdo
Mar 16, 2022

Hello again, in this pdf: https://www.taggmbh.at/fileadmin/content/TAG-Website-Content-SM/2016_Maintenance_list_PROD_PDF.pdf
When using the extract tables, on page 2, I get the following values from one of the lines:

00:'TT..0116'
01:'SStaytsiotenm w toerskts'
02:'MCSS AWrneoiteldnsdteoirnf'
03:'0210..0037..22001166'
04:'0068::0000 hh'
05:'299'
06:'0280..0057..22001166'
07:'0165::0000 hh'
08:'1289'
09:'678 h doauyrss'

etc...

In pdf it looks like:

Can I fix this issue somehow or is this pdf just buggy?

It seems line the line 1 from page 1 and line 1 from page 2 is mixed.

Page 1 line 1:

Page 2 line 1:

jakobdo · 2022-03-16T19:17:52Z

jakobdo
Mar 16, 2022
Author

Some of the values can be corrected by: row[0][1::2], but this is not 100% perfect for all columns. So I guess this is a "buggy" PDF, but I really hope some of you PDF-experts has a bullet proof solution for this issue. :)

2 replies

jsvine Mar 16, 2022
Maintainer

Ah, yes, unfortunately that does seem the PDF is a little buggy — or, perhaps more accurately, poorly designed. Digging in a bit, you can see that there are actually two sets of characters written on top of each other in the same line (though one apparently invisible, possible because a rectangle was written over it):

im = page.to_image()
start = 1541
count = 31
print(pdfplumber.utils.extract_text(page.chars[start:start+count]))
im.reset().draw_rects(page.chars[start:start+count])

Printed: T.01 Station works MS Arnoldstein
Image:

... versus:

start = 2645
count = 28
print(pdfplumber.utils.extract_text(page.chars[start:start+count]))
im.reset().draw_rects(page.chars[start:start+count])

Printed: T.01 Station works MS Arnoldstein

Image:

There might be ways of identifying and removing the errant/extra text, but it would probably require some logic fairly specific to this PDF and the objects within it.

jakobdo Mar 17, 2022
Author

Thanks again @jsvine . Not sure how to add a logic that will fix this. But thanks again for your time and this nice library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double characters/strange values in table #626

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Double characters/strange values in table #626

jakobdo Mar 16, 2022

Replies: 1 comment · 2 replies

jakobdo Mar 16, 2022 Author

jsvine Mar 16, 2022 Maintainer

jakobdo Mar 17, 2022 Author

jakobdo
Mar 16, 2022

Replies: 1 comment 2 replies

jakobdo
Mar 16, 2022
Author

jsvine Mar 16, 2022
Maintainer

jakobdo Mar 17, 2022
Author