-
Notifications
You must be signed in to change notification settings - Fork 0
nathanverrilli/doc2html
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>DOC2HTML</title> <link href="https://fonts.googleapis.com/css2?family=Fanwood+Text:ital@0;1&family=IBM+Plex+Mono:ital,wght@0,500;1,500&display=swap" rel="stylesheet"> <style type="text/css"> h1 { text-align: center; } h2 { text-align: center; margin: 2em 0 1em 0; } h3 { text-align: left; margin: 2em 0 .5em 0; } body { font-size: 1.25em; background-color: #eed; font-family: "Fanwood Text", serif; margin: 0 8px 2em 10px; } p { margin: 0; text-indent: 1.75em; } li p { text-indent: 0; margin: .25em 0 .25em 0; } tt { font-family: "IBM Plex Mono", monospace; color: navy; background-color: #ccfbcc; margin: 0; font-size: 90%; } </style> </head> <body> <h1>A Program to Transform<br /> Complex Text Documents<br />to<br /> Simple HTML</h1> <p>Word processors are wonderful: they manage multiple versions, and WYSIWYG editors such as Microsoft Word, Wordperfect, and LibreOffice can double as document publishing tools for simple (and even complex) documents. </p> <p> But this power comes with a price. Retrieving nicely formatted text as simplistic HTML (without the powerful formatting capabilities) is hard. Native output to HTML wants to include font information, color information, metadata, complex paragraph rules, &hellip </p> <p> Sometimes, all one wants is the simplest formatting. Headers, bold, italic, bold-italic, and paragraphs. Monospace, maybe. Retrieving that bare-bones HTML is difficult. </p> <p> <tt>DOC2HTML</tt> makes it easy. It uses a powerful parser <i>(Apache Tika)</i> to identify and parse input. It then renders the document into simple HTML. </p> <p><tt>DOC2HTML</tt> was written for Java 11. There is an all-in-one JAR file, so running it as simple as </p> <ul> <li><p>Making sure the right Java is present on the system. From a command prompt, run:<br/></p> <div class="indent"> <tt>java --version</tt> </div> <p>As long as the version is 11 or higher, all is well.</p> </li> <li> <p>Download the <tt>doc2html.jar</tt> from the Release folder. This JAR is specially compiled as an all-in-one jar, thanks to the immensely useful <a href="http://one-jar.sourceforge.net/">One-JAR™</a> project by Simon Tuffs. <i>Thank you, Simon, for this contribution to the open source community!</i> </li> <li> <p><tt>java -jar doc2html.jar -f MyDocument.docx -o MyOutput.html</tt> </li> <h2>Using the <b><tt>.exe</tt></b> version of doc2html</h2> <p>This is a Windows executable and it relies on having a Java engine available. I am not certain what version of Java it requires — quite possibly Java 11, but it <i>might</i> function with lower versions. I have tested it only with Java 11. </p> <p>The <tt>doc2html.exe</tt> file is available from the <tt>Releases</tt> folder.</p> <p>I am looking into something that will not require Java on Windows, but the program <i>is</i> written in Java, after all. </p> <h2>Building this Project</h2> <p>Building this project and the OneJar release jar is a two-step process (albeit big steps).</p> <h3>Step One <i>build project</i></h3> <p>Build the project. This is most easily done in Jetbrains IntelliJ, as all the IDE configuration files and project files are committed as part of the project. There is nothing magic about Intellij, however, other than its putting all the required JAR files (and their dependencies) in the <tt>…/out/artifacts/doc2html_jar</tt> directory as part of building the JAR file.</p> <h3>Step Two <i>run makefile</i></h3> <p>From the command line, run <tt>make onejar</tt>, but double-check your assumptions at the door:</p> <ul> <li>The makefile assumes one is running on Windows. Rewriting it to work on *nix is straightforward, but left to the reader.</li> <li>The makefile assumes it can find two critical files <tt>d2h_MANIFEST.mf</tt> and <tt>one-jar-boot-0.97.jar</tt> in the OneJar directory.</li> <li>The makefile assumes it can create (and delete) a temporary directory <tt>root</tt> in the project directory, and it will assemble the project there</li> <li>The makefile assumes it has access to <tt>7z</tt> (to extract files from a jar file, and no, JAR is too limited to do this).</li> <li>The makefile assumes it has access to the <tt>jar</tt> command to reassemble the jar file. Using 7z was tempting from a performance perspective, but I am not certain that 7z will stay within the rigid limits of the JAR file format (also, the <tt>JAR</tt> command makes putting the manifest in easier).</li> <li>This uses GNU <tt>make</tt>. Another flavor of <tt>make</tt> might not behave similarly. It <i>probably</i> will, but this is a list of assumptions, safe and otherwise.</li> </ul> </ul> <h4>Building the DOC2HTML executable <i>not a step, no explicit directions provided here</i></h4> <p>The project was written and compiled with GraalVM-CE, and uses the GraalVM native-image command to transform the One-JAR™ file into an executable, albeit an executable that still requires Java (this makes sense, as GraalVM’s handling of reflection is admitted to be iffy, and One-JAR™ depends on reflection to get its custom class loader working. I’m pleased it works at all, given the reservations that the GraalVM project team expresses about the use of reflection). <p>Building a native-image directly from the initial JAR and the JAR files it requires on its CLASSPATH should be a simpler exercise — but it fails with an error message about a class being missing <i>that is present</i> in the <tt>org.apache.tika.parsers&hellipjar</tt>. This may be another, subtler, manifestation of a reflection problem, as Tika attempts to ignore parsers that aren’t present (more reflection).</p> </body> </html>
About
strip complex word-processing documents (ODT, DOC, DOCX, RTF) into simple HTML
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published