Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many PDF documents don't parse correctly #33

Open
jasonperrone opened this issue Nov 12, 2015 · 2 comments
Open

Many PDF documents don't parse correctly #33

jasonperrone opened this issue Nov 12, 2015 · 2 comments

Comments

@jasonperrone
Copy link

Not sure if other people have this problem, but half of the pdfs I throw at this thing return gobbledeegook for text. The other half are fine. Incidentally, pdf-reader processes those same docs no problem.

@Erol
Copy link
Member

Erol commented Nov 16, 2015

Also noticed that annoying quirk of Tika. One solution I can think of is to drop Tika in favor of pdftotext when parsing PDF files.

@jasonperrone
Copy link
Author

Exactly what I already did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants