Skip to content

A LibreOffice server wrapper that is exposed over HTTP to allow easy conversions from supported documents to PDF.

Notifications You must be signed in to change notification settings

moalhaddar/docx-to-pdf

Repository files navigation

Docx To PDF

Important

As of 31st May, 2024, the service has been rewritten using kotlin and spring boot.

If you wish to use or look at the old source code, you can do so by checking out the nodejs branch.

Overview

This tool transforms Microsoft Word Documents (DOCX) into PDF format.

The Challenge

Though there's an abundance of tools for DOCX to PDF conversion, many face challenges such as:

  • Preserving document formatting.
  • Handling Right-to-Left (RTL) text.
  • Properly displaying various fonts.

Further complicating matters, many of these tools demand specific environments and have multiple, often redundant, layers of abstraction.

Why is accurate DOCX conversion so challenging? The root of the issue lies in the DOCX specification, known as OOXML.

  • Origin & Complexity: OOXML, an XML-based format for DOCX documents, was designed by Microsoft. Its specification is incredibly detailed, spanning over 6,000 pages. It even contains inconsistencies, leading to varied interpretations. It's actually well over 6000 pages long and contradicts itself in some cases.
  • Implementation Challenges: The intricate nature of OOXML means that even Microsoft doesn't perfectly implement it in their official applications. The experience of opening a DOCX file can differ between platforms like macOS and Windows.
  • Contrast with PDFs: PDFs operate on a different structural and conceptual paradigm than DOCX files:

Given these complexities, creating a flawless DOCX to PDF converter remains a daunting challenge, with no full-proof solution in sight. Not anytime soon.

LibreOffice to the rescue

LibreOffice, a powerful open-source office suite, has consistently strived to enhance compatibility with Microsoft Office formats, particularly the .docx format. This is crucial to ensure that users transitioning from Microsoft Office to LibreOffice can continue working with their documents seamlessly.

It offers built-in APIs and a server mode, enabling users to programmatically interface with it through Universal Network Objects (UNO) API.

While not perfect, it is still the best available solution for free for rendering docx formats.

Proposd Solution

This service is built in Java, using spring boot, exposing a simple API that takes a docx file as an input and responds back with a application/pdf file.

It achieves this by launching a LibreOffice server in the background during initialization, then starts communicating with the server through the UNO APIs provided by OpenOffice (OO) to:

  1. Launch an instance.
  2. Load the document.
  3. Save the document as PDF.

And then streaming the response back to the user.

Installation Guide

Using docker

You can pull the latest built docker image from Dockerhub:

docker pull moalhaddar/docx-to-pdf:latest

Then you can run the service:

docker run \
 --rm --name docx-to-pdf \
 -p 8080:8080  \
 -e "pool.size=1" \
 -v ./fonts:/usr/share/fonts/custom \
  moalhaddar/docx-to-pdf:latest

Some details:

  • pool.size: In case you need more workers. Omit if not needed. .Read more in the performance section.
  • /usr/share/fonts/custom: In case you need custom fonts. Omit if not needed. Read more in the fonts section.

Locally

To run this service locally, you will need the following:

  • Java 17

  • Maven

    • Download the zip archive.
    • Include the maven binaries within your $PATH environment variable
        $ export PATH=$PATH:/path/to/folder/apache-maven-3.9.4/bin
    •   $ mvn -v
        Apache Maven 3.9.4 (dfbb324ad4a7c8fb0bf182e6d91b0ae20e3d2dd9)
  • LibreOffice, make sure it's available in your system through the $PATH variable.

    •   $ libreoffice --version
          LibreOffice 7.5.6.2 50(Build:2)
  • Clone this repository and then from the home directory run the service

    •   $ mvn spring-boot:run
  • Expect the service to be running on port 8080 by default.

      2023-10-06T22:36:06.922+03:00  INFO 1513874 --- [           main] d.a.d.DocxToPdfKotlinApplicationKt       : No active profile set, falling back to 1 default profile: "default"
      2023-10-06T22:36:07.414+03:00  INFO 1513874 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat initialized with port(s): 8080 (http)
      2023-10-06T22:36:07.419+03:00  INFO 1513874 --- [           main] o.apache.catalina.core.StandardService   : Starting service [Tomcat]
      2023-10-06T22:36:07.419+03:00  INFO 1513874 --- [           main] o.apache.catalina.core.StandardEngine    : Starting Servlet engine: [Apache Tomcat/10.1.13]
      2023-10-06T22:36:07.463+03:00  INFO 1513874 --- [           main] o.a.c.c.C.[Tomcat].[localhost].[/]       : Initializing Spring embedded WebApplicationContext
      2023-10-06T22:36:07.464+03:00  INFO 1513874 --- [           main] w.s.c.ServletWebServerApplicationContext : Root WebApplicationContext: initialization completed in 513 ms
      2023-10-06T22:36:07.554+03:00  INFO 1513874 --- [atcher-worker-1] d.a.docxtopdf.server.LibreOfficeServer   : [LibreOffice/0] Starting server instance..
      2023-10-06T22:36:07.788+03:00  INFO 1513874 --- [           main] o.s.b.a.e.web.EndpointLinksResolver      : Exposing 1 endpoint(s) beneath base path '/actuator'
      2023-10-06T22:36:07.813+03:00  INFO 1513874 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''
      2023-10-06T22:36:07.821+03:00  INFO 1513874 --- [           main] d.a.d.DocxToPdfKotlinApplicationKt       : Started DocxToPdfKotlinApplicationKt in 1.071 seconds (process running for 1.22)
    

Usage

Once the service is up and running, you can try hitting the service at the /pdf endpoint.

Sample cURL:

curl --output output.pdf \
--location 'http://localhost:8080/pdf' \
--form 'document=@"/home/moalhaddar/example.docx"'

Sample Fetch

var formdata = new FormData();
formdata.append("document", fileInput.files[0], "file.docx");

var requestOptions = {
  method: 'POST',
  body: formdata,
};

fetch("http://localhost:8080/pdf", requestOptions)
  .then(response => response.text())
  .then(result => console.log(result))
  .catch(error => console.log('error', error));

Examples

The below example files are included within this repository here

Example 1: Regular LTR Text with font

image

Example 2: Regular RTL Text with font

image

Example 3: An Image wtih text

image

Example 4: A table

image

Fonts

It will happen that you use a font in the docx document that is not included within your system. In that case, the said font will need to be included either by:

Then you will need to restart the service.

If the font is not included within the system/docker image, then the result PDFs will not look the same as the original docx file.

Performance

This tool was designed with performance in mind.

The default settings runs the service with one converter worker only. If you have enough system resources available and need to increase the worker pool size, you can do so by providing the environment variable pool.size to the service.

Or you can do so by uncommenting the pool.size line in the properties file.

The test is done through a web browser and javascript, checkout index.html & stress.js respectively.

I've stress tested this service with concurrent requests with a very simple 1-page docx file (example 1), here are the results:

Number Of Concurrent Requests Number Of Workers Time to serve all requests
50 1 5.594s
50 4 1.880s
50 8 1.277s
50 16 1.081s

Keep in mind that the input file complexity matters a lot when it comes to the conversion performance.

If you have any performance observations, don't hesitate to open and report an issue.

License

MIT License

Author

Mohammad Alhaddar

About

A LibreOffice server wrapper that is exposed over HTTP to allow easy conversions from supported documents to PDF.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published