Cutting off first character of last column #361

arenasc · 2023-01-03T20:01:11Z

arenasc
Jan 3, 2023

Summary of your issue

When extracting a PDF, the first character of the last column is cut off. All other columns are fine. PDF only has horizontal lines to determine each row of data. Rows have line breaks. No Vertical lines to separate rows. Example:
PDF => Col1 | Col2 | Col3
CSV => Col1 | Col2 | ol3

Tried 'Lattice = True' as well. Made the results worse.

Check list before submit

Did you read FAQ?
(Optional, but really helpful) Your PDF URL: n/a
Paste the output of import tabula; tabula.environment_info() on Python REPL: ?

If not possible to execute tabula.environment_info(), please answer following questions manually.

Paste the output of python --version command on your terminal: Python 3.10.5
Paste the output of java -version command on your terminal: java 19.0.1 2022-10-18
Does java -h command work well?; Ensure your java command is included in PATH
Write your OS and it's version: macOS Monterey v12.6

What did you do when you faced the problem?

I checked the FAQs. Tried limiting the page number to 1 page. Same results.

Code:

import tabula
tabula.convert_into("test3.pdf", "test3.csv", output_format="csv", pages='all')

Expected behavior:

See above

Actual behavior:

See above

Related Issues:

n/a

Answered by chezou

Jan 9, 2023

The description comes from tabula-java's one. I'm not sure what your point is.
https://github.com/tabulapdf/tabula-java#commandline-usage-examples

Example code can be found in this article by @tdpetrou https://www.dunderdata.com/blog/read-trapped-tables-within-pdfs-as-pandas-dataframes

If you think you want to set the same columns option between different PDFs, that is not possible. You need to set the columns option per table.

View full answer

chezou · 2023-01-06T00:08:27Z

chezou
Jan 6, 2023
Maintainer

Thanks for reporting. I saw similar stuff when the vertical line was too close to the character.

One thing I come up with is to use the columns option. Can you try it? See details in: https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.read_pdf
Without having the PDF, this is what I can suggest.

0 replies

arenasc · 2023-01-09T20:06:14Z

arenasc
Jan 9, 2023
Author

hi @chezou,
Thanks for the recommendation! I added "columns=[10.1, 20.2, 30.3]" to the code and the column is no long cut off 👍 Am I

0 replies

arenasc · 2023-01-09T20:09:31Z

arenasc
Jan 9, 2023
Author

Ugh, pressed the wrong button sorry @chezou. My follow-up question is if my syntax is correct. Using "columns=[10.1, 20.2, 30.3]" worked, but for future PDFs, I want to fully understand how to use the option. I know it's supposed to be: X coordinates of column boundaries. Can you please explain or suggest a better explanation from the documentation? Thanks!

0 replies

chezou · 2023-01-09T22:40:53Z

chezou
Jan 9, 2023
Maintainer

The description comes from tabula-java's one. I'm not sure what your point is.
https://github.com/tabulapdf/tabula-java#commandline-usage-examples

Example code can be found in this article by @tdpetrou https://www.dunderdata.com/blog/read-trapped-tables-within-pdfs-as-pandas-dataframes

If you think you want to set the same columns option between different PDFs, that is not possible. You need to set the columns option per table.

0 replies

arenasc · 2023-01-10T15:22:04Z

arenasc
Jan 10, 2023
Author

Hi,
Thanks again for the documentation and the article. This clears up a lot for me. Thanks again! :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cutting off first character of last column #361

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cutting off first character of last column #361

arenasc Jan 3, 2023

Summary of your issue

Check list before submit

What did you do when you faced the problem?

Code:

Expected behavior:

Actual behavior:

Related Issues:

Replies: 5 comments

chezou Jan 6, 2023 Maintainer

arenasc Jan 9, 2023 Author

arenasc Jan 9, 2023 Author

chezou Jan 9, 2023 Maintainer

arenasc Jan 10, 2023 Author

arenasc
Jan 3, 2023

chezou
Jan 6, 2023
Maintainer

arenasc
Jan 9, 2023
Author

arenasc
Jan 9, 2023
Author

chezou
Jan 9, 2023
Maintainer

arenasc
Jan 10, 2023
Author