Skip to content

Commit 0cbc2fa

Browse files
committed
Update README with DocxodusEngine as recommended engine
Rewrite README to prominently feature Docxodus as the recommended comparison engine, with a link back to the Docxodus repo. Reorganize sections around the dual-engine architecture and add a quick example.
1 parent 48269d9 commit 0cbc2fa

File tree

1 file changed

+119
-95
lines changed

1 file changed

+119
-95
lines changed

README.md

Lines changed: 119 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -3,156 +3,180 @@
33
## Project Goal - Democratizing DOCX Comparisons
44

55
The main goal of this project is to address the significant gap in the open-source ecosystem around `.docx` document
6-
comparison tools. Currently, the process of comparing and generating redline documents (documents that highlight
7-
changes between versions) is complex and largely dominated by commercial software. These
8-
tools, while effective, often come with cost barriers and limitations in terms of accessibility and integration
6+
comparison tools. Currently, the process of comparing and generating redline documents (documents that highlight
7+
changes between versions) is complex and largely dominated by commercial software. These
8+
tools, while effective, often come with cost barriers and limitations in terms of accessibility and integration
99
flexibility.
1010

11-
`Python-redlines` aims to democratize the ability to run tracked change redlines for .docx, providing the
11+
`Python-redlines` aims to democratize the ability to run tracked change redlines for .docx, providing the
1212
open-source community with a tool to create `.docx` redlines without the need for commercial software. This will let
1313
more legal hackers and hobbyist innovators experiment and create tooling for enterprise and legal.
1414

15-
## Project Roadmap
15+
## Comparison Engines
1616

17-
### Step 1. Open-XML-PowerTools `WmlComparer` Wrapper
17+
Python-Redlines ships with **two comparison engines** — choose the one that best fits your needs:
1818

19-
The [Open-XML-PowerTools](https://github.com/OpenXmlDev/Open-Xml-PowerTools) project historically offered a solid
20-
foundation for working with `.docx` files and has an excellent (if imperfect) comparison engine in its `WmlComparer`
21-
class. However, Microsoft archived the repository almost five years ago, and a forked repo is not being actively
22-
maintained, as its most recent commits dates from 2 years ago and the repo issues list is disabled.
19+
### `DocxodusEngine` — Recommended
2320

24-
As a first step, our project aims to bring the existing capabilities of WmlCompare into the Python world. Thankfully,
25-
XML Power Tools is full cross-platform as it is written in .NET and compiles with the still-maintained .NET 8. The
26-
resulting binaries can be compiled for the latest versions of Windows, OSX and Linux (Ubuntu specifically, though other
27-
distributions should work fine too). We have included an OSX build but do not have an OSX machine to test on. Please
28-
report an issues by opening a new Issue.
21+
**[Docxodus](https://github.com/JSv4/Docxodus)** is a modernized .NET 8.0 fork of Open-XML-PowerTools with
22+
significant improvements:
2923

30-
The initial release has a single engine `XmlPowerToolsEngine`, which is just a Python wrapper for a simple C# utility
31-
written to leverage WmlComparer for 1-to-1 redlines. We hope this provides a stop-gap capability to Python developers
32-
seeking .docx redline capabilities.
24+
- **Move detection** — identifies content that was moved rather than deleted and re-inserted
25+
- **Format change detection** — detects changes to bold, italic, font size, and other run properties
26+
- **Better table handling** — LCS-based row matching for large tables
27+
- **Actively maintained** — regular bug fixes and new features
28+
- **Open XML SDK 3.x compatible** — uses the latest SDK version
3329

34-
**Note**, we don't plan to fork or maintain Open-XML-PowerTools. [Version 4.4.0](https://www.nuget.org/packages/Open-Xml-PowerTools/),
35-
which appears to only be compatible with [Open XML SDK < 3.0.0](https://www.nuget.org/packages/DocumentFormat.OpenXml) works
36-
for now, it needs to be made compatible with the latest versions of the Open XML SDK to extend its life. **There are
37-
also some [issues](https://github.com/dotnet/Open-XML-SDK/issues/1634)**, and it seems the only maintainer of
38-
Open-XML-PowerTools probably won't fix, and understanding the existing code base is no small task. Please be aware that
39-
**Open XML PowerTools is not a perfect comparison engine, but it will work for many purposes. Use at your own risk.**
30+
```python
31+
from python_redlines import DocxodusEngine
4032

41-
### Step 2. Pure Python Comparison Engine
42-
43-
Looking towards the future, rather than reverse engineer `WmlComparer` and maintain a C# codebase, we envision a
44-
comparison engine written in python. We've done some experimentation with [`xmldiff`](https://github.com/Shoobx/xmldiff)
45-
as the engine to compare the underlying xml of docx files. Specifically, we've built a prototype to unzip `.docx` files,
46-
execute an xml comparison using `xmldiff`, and then reconstructed a tracked changes docx with the proper Open XML
47-
(ooxml) tracked change tags. Preliminary experimentation with this approach has shown promise, indicating its
48-
feasibility for handling modifications such as simple span inserts and deletes.
49-
50-
However, this ambitious endeavor is not without its challenges. The intricacies of `.docx` files and the potential for
51-
complex, corner-case scenarios necessitate a thoughtful and thorough development process. In the interim, `WmlComparer`
52-
is a great solution as it has clearly been built to account for many such corner cases, through a development process
53-
that clearly was influenced by issues discovered by a large user base. The XMLDiff engine will take some time to reach
54-
a level of maturity similar to WmlComparer. At the moment it is NOT included.
33+
engine = DocxodusEngine()
34+
redline_bytes, stdout, stderr = engine.run_redline("AuthorName", original_bytes, modified_bytes)
35+
```
5536

56-
## Getting started
37+
### `XmlPowerToolsEngine` — Legacy
5738

58-
### Install .NET Core 8
39+
Wraps the original [Open-XML-PowerTools](https://github.com/OpenXmlDev/Open-Xml-PowerTools) `WmlComparer`. This
40+
engine remains available for backward compatibility and for users who prefer the original comparison behavior.
5941

60-
The Open-XML-PowerTools engine we're using in the initial releases requires .NET to run (don't worry, this is very
61-
well-supported cross-platform at the moment). Our builds are targeting x86-64 Linux and Windows, however, so you'll
62-
need to modify the build script and build new binaries if you want to target another runtime / architecture.
42+
```python
43+
from python_redlines import XmlPowerToolsEngine
6344

64-
#### On Linux
45+
engine = XmlPowerToolsEngine()
46+
redline_bytes, stdout, stderr = engine.run_redline("AuthorName", original_bytes, modified_bytes)
47+
```
6548

66-
You can follow [Microsoft's instructions for your Linux distribution](https://learn.microsoft.com/en-us/dotnet/core/install/linux)
49+
> **Note:** Open-XML-PowerTools was archived by Microsoft and is no longer maintained. It uses an older
50+
> version of the Open XML SDK. While it works for many purposes, Docxodus is the recommended engine going forward.
6751
68-
#### On Windows
52+
Both engines share the same API — the only difference is the class you instantiate and the stdout format
53+
(see [Stdout Differences](#stdout-differences) below).
6954

70-
You can follow [Microsoft's instructions for your Windows vesrion](https://learn.microsoft.com/en-us/dotnet/core/install/windows?tabs=net80)
55+
## Getting Started
7156

7257
### Install the Library
7358

74-
At the moment, we are not distributing via pypi. You can easily install directly from this repo, however.
75-
7659
```commandline
7760
pip install git+https://github.com/JSv4/Python-Redlines
7861
```
7962

80-
You can add this as a dependency like so
63+
You can add this as a dependency like so:
8164

8265
```requirements
83-
python_redlines @ git+https://github.com/JSv4/Python-Redlines@v0.0.1
66+
python_redlines @ git+https://github.com/JSv4/Python-Redlines@v0.0.4
8467
```
8568

8669
### Use the Library
8770

8871
If you just want to use the tool, jump into our [quickstart guide](docs/quickstart.md).
8972

90-
## Architecture Overview
73+
### Quick Example
74+
75+
```python
76+
from python_redlines import DocxodusEngine
9177

92-
`XmlPowerToolsEngine` is a Python wrapper class for the `redlines` C# command-line tool, source of which is available in
93-
[./csproj/Program.cs](./csproj/Program.cs). The redlines utility and wrapper let you compare two docx files and
94-
show the differences in tracked changes (a "redline" document).
78+
# Load your documents as bytes
79+
with open("original.docx", "rb") as f:
80+
original = f.read()
81+
with open("modified.docx", "rb") as f:
82+
modified = f.read()
9583

96-
### C# Functionality
84+
# Generate a redline document
85+
engine = DocxodusEngine()
86+
redline_bytes, stdout, stderr = engine.run_redline("Reviewer", original, modified)
9787

98-
The `redlines` C# utility is a command line tool that requires four arguments:
99-
1. `author_tag` - A tag to identify the author of the changes.
100-
2. `original_path.docx` - Path to the original document.
101-
3. `modified_path.docx` - Path to the modified document.
102-
4. `redline_path.docx` - Path where the redlined document will be saved.
88+
# Save the result
89+
with open("redline.docx", "wb") as f:
90+
f.write(redline_bytes)
10391

104-
The Python wrapper, `XmlPowerToolsEngine` and its main method `run_redline()`, simplifies the use of `redlines` by
105-
orchestrating its execution with Python and letting you pass in bytes or file paths for the original and modified
106-
documents.
92+
print(stdout) # e.g. "Redline complete: 9 revision(s) found"
93+
```
94+
95+
## Architecture Overview
10796

108-
### Packaging
97+
Both engines follow the same pattern: a Python wrapper class invokes a self-contained C# binary via subprocess.
98+
The binary takes four arguments: `<author_tag> <original.docx> <modified.docx> <output.docx>`.
10999

110-
The project is structured as follows:
111100
```
112101
python-redlines/
113102
114-
├── csproj/
115-
│ ├── bin/
116-
│ ├── obj/
103+
├── csproj/ # XmlPowerTools C# source
117104
│ ├── Program.cs
118-
│ ├── redlines.csproj
119-
│ └── redlines.sln
105+
│ └── redlines.csproj
120106
121-
├── docs/
122-
│ ├── developer-guide.md
123-
│ └── quickstart.md
107+
├── docxodus/ # Docxodus git submodule
108+
│ └── tools/redline/
109+
│ ├── Program.cs
110+
│ └── redline.csproj
124111
125112
├── src/
126113
│ └── python_redlines/
127-
│ ├── bin/
128-
│ │ └── .gitignore
129-
│ ├── dist/
130-
│ │ ├── .gitignore
131-
│ │ ├── linux-x64-0.0.1.tar.gz
132-
│ │ └── win-x64-0.0.1.zip
114+
│ ├── engines.py # BaseEngine, XmlPowerToolsEngine, DocxodusEngine
115+
│ ├── dist/ # XmlPowerTools compressed binaries
116+
│ ├── dist_docxodus/ # Docxodus compressed binaries
117+
│ ├── bin/ # XmlPowerTools extracted binaries (runtime)
118+
│ ├── bin_docxodus/ # Docxodus extracted binaries (runtime)
133119
│ ├── __about__.py
134-
│ ├── __init__.py
135-
│ └── engines.py
120+
│ └── __init__.py
136121
137122
├── tests/
138-
| ├── fixtures/
139-
| ├── test_openxml_differ.py
140-
| └── __init__.py
141-
|
142-
├── .gitignore
143-
├── build_differ.py
144-
├── extract_version.py
145-
├── License.md
123+
│ ├── fixtures/
124+
│ ├── test_openxml_differ.py # XmlPowerTools integration test
125+
│ ├── test_docxodus_engine.py # Docxodus integration test
126+
│ └── test_engine_contract.py # Shared contract tests for both engines
127+
128+
├── build_differ.py # Builds both engines for all platforms
146129
├── pyproject.toml
147130
└── README.md
148131
```
149132

150-
- `src/your_package/`: Contains the Python wrapper code.
151-
- `dist/`: Contains the zipped C# binaries for different platforms.
152-
- `bin/`: Target directory for extracted binaries.
153-
- `tests/`: Contains test cases and fixtures for the wrapper.
133+
Pre-compiled binaries for 6 platform targets (linux/win/osx x x64/arm64) are bundled in the wheel for each engine.
134+
On first use, the appropriate binary is extracted and cached.
135+
136+
### Stdout Differences
137+
138+
The two engines produce slightly different stdout messages:
139+
140+
| Engine | Example stdout |
141+
|---|---|
142+
| `XmlPowerToolsEngine` | `Revisions found: 9` |
143+
| `DocxodusEngine` | `Redline complete: 9 revision(s) found` |
144+
145+
## Development
146+
147+
### Prerequisites
148+
149+
- Python 3.8+
150+
- .NET 8.0 SDK (for building C# binaries)
151+
152+
### Setup
153+
154+
```bash
155+
# Clone with submodules
156+
git clone --recurse-submodules https://github.com/JSv4/Python-Redlines
157+
cd Python-Redlines
158+
159+
# If you already cloned without submodules
160+
git submodule update --init --recursive
161+
```
162+
163+
### Commands
164+
165+
```bash
166+
# Run tests
167+
hatch run test
168+
169+
# Run a single test
170+
hatch run test tests/test_openxml_differ.py::test_run_redlines_with_real_files
171+
172+
# Build C# binaries for all platforms
173+
hatch run build
174+
175+
# Build Python package
176+
hatch build
177+
```
154178

155-
### Detailed Explanation and Dev Setup
179+
### Detailed Dev Setup
156180

157181
If you want to contribute to the library or want to dive into some of the C# packaging architecture, go to our
158182
[developer guide](docs/developer-guide.md).

0 commit comments

Comments
 (0)