Skip to content
This repository has been archived by the owner on Dec 9, 2018. It is now read-only.

pdf2htmlEX - output html source code #761

Open
MBhat6 opened this issue Mar 24, 2018 · 9 comments
Open

pdf2htmlEX - output html source code #761

MBhat6 opened this issue Mar 24, 2018 · 9 comments

Comments

@MBhat6
Copy link

MBhat6 commented Mar 24, 2018

I have a issue with pdf2htmlEx output. I created a html output for my pdf document, and it renders nicely. But in the source code I see that the words are broken and are separated with $, ! and spaces and and spans. In fact at times there are lots of ! And $ signs.

In my program I generate the html file and I search for keywords in the text and put tags to highlight them. But because of the broken words, this output doesn’t let me search my keywords. The browser search works great however.

Any suggestions or work around is appreciated

@ebbandari
Copy link

ebbandari commented Mar 27, 2018

I face the same issue. Not sure why there are so many $ and ! signs, and some words have space addeded in the middle. The text seems to be in one line too.
Is there a way to create cleaner files, or convert this file to a cleaner html?

@mortenmoulder
Copy link

I have the same issue here. I need to replace a bunch of words in the HTML file, but because of these <span>-tags everywhere, I can't search and replace.

I wonder if it's possible to stop that from happening in the source. So it won't break up words.

@ebbandari
Copy link

ebbandari commented Apr 5, 2018 via email

@mortenmoulder
Copy link

@ebbandari Non-commercial? As far as I know, GPLv3 licensed software can be used for commercial use as much as you want.

@ebbandari
Copy link

ebbandari commented Apr 5, 2018 via email

@mortenmoulder
Copy link

@ebbandari
Copy link

ebbandari commented Apr 5, 2018 via email

@mortenmoulder
Copy link

@ebbandari Exactly. I can use it for whatever I want, but if I go out and make a "PDF to HTML converter" and use pdf2htmlEX as my tool, my business is built solely from the code of pdf2htmlEX (if pdf2htmlEX did not exist, neither would my product).

As long as we use them as tools, we can use them as much as we want.

@subodhkalika
Copy link

subodhkalika commented Jul 3, 2018

@MBhat6
Yes there is a solution.
I had the same issue. I had to highlight the keywords in the HTML document which was rendered correctly but the words were broken by span.

I made use of BeautifulSoup, a python package, to parse the html and mark(highlight) the keywords.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants