Skip to content

strip complex word-processing documents (ODT, DOC, DOCX, RTF) into simple HTML

Notifications You must be signed in to change notification settings

nathanverrilli/doc2html

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>DOC2HTML</title>
  <link href="https://fonts.googleapis.com/css2?family=Fanwood+Text:ital@0;1&family=IBM+Plex+Mono:ital,wght@0,500;1,500&display=swap" rel="stylesheet">
  <style type="text/css">
    h1 {
      text-align: center;
    }
    h2 {
      text-align: center;
      margin: 2em 0 1em 0;
    }
    h3 {
      text-align: left;
      margin: 2em 0 .5em 0;
    }
    body {
      font-size: 1.25em;
      background-color: #eed;
      font-family: "Fanwood Text", serif;
      margin: 0 8px 2em 10px;
    }
    p {

      margin: 0;
      text-indent: 1.75em;
    }
    li p {
      text-indent: 0;
      margin: .25em 0 .25em 0;
    }
    tt {
      font-family: "IBM Plex Mono", monospace;
      color: navy;
      background-color: #ccfbcc;
      margin: 0;
      font-size: 90%;
    }
  </style>
</head>
<body>
     <h1>A Program to Transform<br />
       Complex Text Documents<br />to<br />
       Simple HTML</h1>
<p>Word processors are wonderful: they manage multiple versions,
  and WYSIWYG editors such as Microsoft Word, Wordperfect, and
  LibreOffice can double as document publishing tools for simple
  (and even complex) documents.
</p>
<p>
  But this power comes with a price. Retrieving nicely
  formatted text as simplistic HTML (without the powerful
  formatting capabilities) is hard. Native output to HTML
  wants to include font information, color information,
  metadata, complex paragraph rules, &hellip
</p>
<p>
  Sometimes, all one wants is the simplest formatting.
  Headers, bold, italic, bold-italic, and paragraphs.
  Monospace, maybe. Retrieving that bare-bones HTML
  is difficult.
</p>
<p>
  <tt>DOC2HTML</tt> makes it easy. It uses a powerful
  parser <i>(Apache Tika)</i>
  to identify and parse input. It then renders the document
  into simple HTML.
</p>
<p><tt>DOC2HTML</tt> was written for Java 11. There is an all-in-one
JAR file, so running it as simple as </p>
<ul>
  <li><p>Making sure the right Java is present on the system. From a
  command prompt, run:<br/></p>
  <div class="indent">
    <tt>java --version</tt>
  </div>
    <p>As long as the version is 11 or higher, all is well.</p>
  </li>
  <li>
    <p>Download the <tt>doc2html.jar</tt> from the Release folder.
    This JAR is specially compiled as an all-in-one jar, thanks to
      the immensely useful
      <a href="http://one-jar.sourceforge.net/">One-JAR&trade;</a>
      project by Simon Tuffs. <i>Thank you, Simon, for this contribution to the open
    source community!</i>
  </li>
  <li>
    <p><tt>java -jar doc2html.jar -f MyDocument.docx -o MyOutput.html</tt>
  </li>
  <h2>Using the <b><tt>.exe</tt></b> version of doc2html</h2>
  <p>This is a Windows executable
    and it relies on having a Java engine available.
    I am not certain what version of Java it requires &mdash; quite possibly Java 11, but
    it <i>might</i> function with lower versions. I have tested it only with Java 11.
  </p>
  <p>The <tt>doc2html.exe</tt> file is available from the <tt>Releases</tt> folder.</p>
  <p>I am looking into something that will not require Java on Windows, but the program <i>is</i>
  written in Java, after all.
  </p>


<h2>Building this Project</h2>
  <p>Building this project and the OneJar release jar is a two-step process (albeit big steps).</p>
  <h3>Step One <i>build project</i></h3>
  <p>Build the project. This is most easily done in Jetbrains IntelliJ, as all
  the IDE configuration files and project files are committed as part of the project. There is
  nothing magic about Intellij, however, other than its putting all the required
    JAR files (and their dependencies) in the <tt>&hellip;/out/artifacts/doc2html_jar</tt>
  directory as part of building the JAR file.</p>
  <h3>Step Two <i>run makefile</i></h3>
  <p>From the command line, run&nbsp;<tt>make onejar</tt>, but double-check your
  assumptions at the door:</p>
  <ul>
    <li>The makefile assumes one is running on Windows. Rewriting it to work on *nix
    is straightforward, but left to the reader.</li>
    <li>The makefile assumes it can find two critical files <tt>d2h_MANIFEST.mf</tt>
    and <tt>one-jar-boot-0.97.jar</tt> in the OneJar directory.</li>
    <li>The makefile assumes it can create (and delete)
      a temporary directory <tt>root</tt>
    in the project directory, and it will assemble the project there</li>
    <li>The makefile assumes it has access to <tt>7z</tt> (to extract files from a
      jar file, and no, JAR is too limited to do this).</li>
    <li>The makefile assumes it has access to the <tt>jar</tt>
      command to reassemble the jar file. Using 7z was tempting
    from a performance perspective, but I am not certain that
    7z will stay within the rigid limits of the JAR file format
      (also, the <tt>JAR</tt> command makes putting
      the manifest in easier).</li>
    <li>This uses GNU <tt>make</tt>.
      Another flavor of <tt>make</tt>
    might not behave similarly. It <i>probably</i> will,
    but this is a list of assumptions, safe and otherwise.</li>
  </ul>
</ul>
     <h4>Building the DOC2HTML executable <i>not a step, no explicit directions provided here</i></h4>
<p>The project was written and compiled with GraalVM-CE, and uses the GraalVM native-image
command to transform the One-JAR&trade; file into an executable, albeit an executable that
still requires Java (this makes sense, as GraalVM&rsquo;s handling of reflection is
  admitted to be iffy, and One-JAR&trade; depends on reflection to get its custom class loader
  working. I&rsquo;m pleased it works at all, given the reservations that the GraalVM
  project team expresses about the use of reflection).
<p>Building a native-image directly from the initial JAR and the JAR files it requires on its
     CLASSPATH should be a simpler exercise &mdash; but it fails with an error message about
     a class being missing <i>that is present</i> in the <tt>org.apache.tika.parsers&hellipjar</tt>.
     This may be another, subtler, manifestation of a reflection problem, as Tika attempts to
     ignore parsers that aren&rsquo;t present (more reflection).</p>
</body>
</html>

About

strip complex word-processing documents (ODT, DOC, DOCX, RTF) into simple HTML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published