Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

Improve --space-as-offset: determine spaces by unicode #446

Open
wants to merge 1 commit into
base: incoming
Choose a base branch
from

Conversation

duanyao
Copy link
Collaborator

@duanyao duanyao commented Nov 15, 2014

Fix #445.
Now --space-as-offset works on "unicode space" instead of ASCII SPACE before decoding the text.
This change should also increases the oppotunities of converting spaces to offsets.
However for PDFs with bad unicode support, this may still drop chars, though I haven't found an example yet.

@coolwanglu
Copy link
Owner

--space-as-offset may not guarantee to work if either the ToUnicode mapping for the font encoding is corrupted. In fact I had a few test cases before, where the font encoding is OK yet ToUnicode is missing or corrupted. According to my experience, there are more issues in the ToUnicode mappings, especially for old PDF files.

Seems that old PDF generators/converters were not able to handle this well -- after all this has nothing to do with printing. And ToUnicode is indeed optional in the standard.

I'm not sure if this is a good solution. Or possible we can take consideration of the --to-unicode parameter, that whether we trust the mapping.

@duanyao
Copy link
Collaborator Author

duanyao commented Nov 15, 2014

If ToUnicode is missing, can we just ignore --space-as-offset 1 for that font automaticly? We can also add --space-as-offset 2 to force it on even if ToUnicode is missing. However it seems impossible to detect whether ToUnicode or font encoding is corrupted.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants