|
3 | 3 | ## Project Goal - Democratizing DOCX Comparisons |
4 | 4 |
|
5 | 5 | The main goal of this project is to address the significant gap in the open-source ecosystem around `.docx` document |
6 | | -comparison tools. Currently, the process of comparing and generating redline documents (documents that highlight |
7 | | -changes between versions) is complex and largely dominated by commercial software. These |
8 | | -tools, while effective, often come with cost barriers and limitations in terms of accessibility and integration |
| 6 | +comparison tools. Currently, the process of comparing and generating redline documents (documents that highlight |
| 7 | +changes between versions) is complex and largely dominated by commercial software. These |
| 8 | +tools, while effective, often come with cost barriers and limitations in terms of accessibility and integration |
9 | 9 | flexibility. |
10 | 10 |
|
11 | | -`Python-redlines` aims to democratize the ability to run tracked change redlines for .docx, providing the |
| 11 | +`Python-redlines` aims to democratize the ability to run tracked change redlines for .docx, providing the |
12 | 12 | open-source community with a tool to create `.docx` redlines without the need for commercial software. This will let |
13 | 13 | more legal hackers and hobbyist innovators experiment and create tooling for enterprise and legal. |
14 | 14 |
|
15 | | -## Project Roadmap |
| 15 | +## Comparison Engines |
16 | 16 |
|
17 | | -### Step 1. Open-XML-PowerTools `WmlComparer` Wrapper |
| 17 | +Python-Redlines ships with **two comparison engines** — choose the one that best fits your needs: |
18 | 18 |
|
19 | | -The [Open-XML-PowerTools](https://github.com/OpenXmlDev/Open-Xml-PowerTools) project historically offered a solid |
20 | | -foundation for working with `.docx` files and has an excellent (if imperfect) comparison engine in its `WmlComparer` |
21 | | -class. However, Microsoft archived the repository almost five years ago, and a forked repo is not being actively |
22 | | -maintained, as its most recent commits dates from 2 years ago and the repo issues list is disabled. |
| 19 | +### `DocxodusEngine` — Recommended |
23 | 20 |
|
24 | | -As a first step, our project aims to bring the existing capabilities of WmlCompare into the Python world. Thankfully, |
25 | | -XML Power Tools is full cross-platform as it is written in .NET and compiles with the still-maintained .NET 8. The |
26 | | -resulting binaries can be compiled for the latest versions of Windows, OSX and Linux (Ubuntu specifically, though other |
27 | | -distributions should work fine too). We have included an OSX build but do not have an OSX machine to test on. Please |
28 | | -report an issues by opening a new Issue. |
| 21 | +**[Docxodus](https://github.com/JSv4/Docxodus)** is a modernized .NET 8.0 fork of Open-XML-PowerTools with |
| 22 | +significant improvements: |
29 | 23 |
|
30 | | -The initial release has a single engine `XmlPowerToolsEngine`, which is just a Python wrapper for a simple C# utility |
31 | | -written to leverage WmlComparer for 1-to-1 redlines. We hope this provides a stop-gap capability to Python developers |
32 | | -seeking .docx redline capabilities. |
| 24 | +- **Move detection** — identifies content that was moved rather than deleted and re-inserted |
| 25 | +- **Format change detection** — detects changes to bold, italic, font size, and other run properties |
| 26 | +- **Better table handling** — LCS-based row matching for large tables |
| 27 | +- **Actively maintained** — regular bug fixes and new features |
| 28 | +- **Open XML SDK 3.x compatible** — uses the latest SDK version |
33 | 29 |
|
34 | | -**Note**, we don't plan to fork or maintain Open-XML-PowerTools. [Version 4.4.0](https://www.nuget.org/packages/Open-Xml-PowerTools/), |
35 | | -which appears to only be compatible with [Open XML SDK < 3.0.0](https://www.nuget.org/packages/DocumentFormat.OpenXml) works |
36 | | -for now, it needs to be made compatible with the latest versions of the Open XML SDK to extend its life. **There are |
37 | | -also some [issues](https://github.com/dotnet/Open-XML-SDK/issues/1634)**, and it seems the only maintainer of |
38 | | -Open-XML-PowerTools probably won't fix, and understanding the existing code base is no small task. Please be aware that |
39 | | -**Open XML PowerTools is not a perfect comparison engine, but it will work for many purposes. Use at your own risk.** |
| 30 | +```python |
| 31 | +from python_redlines import DocxodusEngine |
40 | 32 |
|
41 | | -### Step 2. Pure Python Comparison Engine |
42 | | - |
43 | | -Looking towards the future, rather than reverse engineer `WmlComparer` and maintain a C# codebase, we envision a |
44 | | -comparison engine written in python. We've done some experimentation with [`xmldiff`](https://github.com/Shoobx/xmldiff) |
45 | | -as the engine to compare the underlying xml of docx files. Specifically, we've built a prototype to unzip `.docx` files, |
46 | | -execute an xml comparison using `xmldiff`, and then reconstructed a tracked changes docx with the proper Open XML |
47 | | -(ooxml) tracked change tags. Preliminary experimentation with this approach has shown promise, indicating its |
48 | | -feasibility for handling modifications such as simple span inserts and deletes. |
49 | | - |
50 | | -However, this ambitious endeavor is not without its challenges. The intricacies of `.docx` files and the potential for |
51 | | -complex, corner-case scenarios necessitate a thoughtful and thorough development process. In the interim, `WmlComparer` |
52 | | -is a great solution as it has clearly been built to account for many such corner cases, through a development process |
53 | | -that clearly was influenced by issues discovered by a large user base. The XMLDiff engine will take some time to reach |
54 | | -a level of maturity similar to WmlComparer. At the moment it is NOT included. |
| 33 | +engine = DocxodusEngine() |
| 34 | +redline_bytes, stdout, stderr = engine.run_redline("AuthorName", original_bytes, modified_bytes) |
| 35 | +``` |
55 | 36 |
|
56 | | -## Getting started |
| 37 | +### `XmlPowerToolsEngine` — Legacy |
57 | 38 |
|
58 | | -### Install .NET Core 8 |
| 39 | +Wraps the original [Open-XML-PowerTools](https://github.com/OpenXmlDev/Open-Xml-PowerTools) `WmlComparer`. This |
| 40 | +engine remains available for backward compatibility and for users who prefer the original comparison behavior. |
59 | 41 |
|
60 | | -The Open-XML-PowerTools engine we're using in the initial releases requires .NET to run (don't worry, this is very |
61 | | -well-supported cross-platform at the moment). Our builds are targeting x86-64 Linux and Windows, however, so you'll |
62 | | -need to modify the build script and build new binaries if you want to target another runtime / architecture. |
| 42 | +```python |
| 43 | +from python_redlines import XmlPowerToolsEngine |
63 | 44 |
|
64 | | -#### On Linux |
| 45 | +engine = XmlPowerToolsEngine() |
| 46 | +redline_bytes, stdout, stderr = engine.run_redline("AuthorName", original_bytes, modified_bytes) |
| 47 | +``` |
65 | 48 |
|
66 | | -You can follow [Microsoft's instructions for your Linux distribution](https://learn.microsoft.com/en-us/dotnet/core/install/linux) |
| 49 | +> **Note:** Open-XML-PowerTools was archived by Microsoft and is no longer maintained. It uses an older |
| 50 | +> version of the Open XML SDK. While it works for many purposes, Docxodus is the recommended engine going forward. |
67 | 51 |
|
68 | | -#### On Windows |
| 52 | +Both engines share the same API — the only difference is the class you instantiate and the stdout format |
| 53 | +(see [Stdout Differences](#stdout-differences) below). |
69 | 54 |
|
70 | | -You can follow [Microsoft's instructions for your Windows vesrion](https://learn.microsoft.com/en-us/dotnet/core/install/windows?tabs=net80) |
| 55 | +## Getting Started |
71 | 56 |
|
72 | 57 | ### Install the Library |
73 | 58 |
|
74 | | -At the moment, we are not distributing via pypi. You can easily install directly from this repo, however. |
75 | | - |
76 | 59 | ```commandline |
77 | 60 | pip install git+https://github.com/JSv4/Python-Redlines |
78 | 61 | ``` |
79 | 62 |
|
80 | | -You can add this as a dependency like so |
| 63 | +You can add this as a dependency like so: |
81 | 64 |
|
82 | 65 | ```requirements |
83 | | -python_redlines @ git+https://github.com/JSv4/Python-Redlines@v0.0.1 |
| 66 | +python_redlines @ git+https://github.com/JSv4/Python-Redlines@v0.0.4 |
84 | 67 | ``` |
85 | 68 |
|
86 | 69 | ### Use the Library |
87 | 70 |
|
88 | 71 | If you just want to use the tool, jump into our [quickstart guide](docs/quickstart.md). |
89 | 72 |
|
90 | | -## Architecture Overview |
| 73 | +### Quick Example |
| 74 | + |
| 75 | +```python |
| 76 | +from python_redlines import DocxodusEngine |
91 | 77 |
|
92 | | -`XmlPowerToolsEngine` is a Python wrapper class for the `redlines` C# command-line tool, source of which is available in |
93 | | -[./csproj/Program.cs](./csproj/Program.cs). The redlines utility and wrapper let you compare two docx files and |
94 | | -show the differences in tracked changes (a "redline" document). |
| 78 | +# Load your documents as bytes |
| 79 | +with open("original.docx", "rb") as f: |
| 80 | + original = f.read() |
| 81 | +with open("modified.docx", "rb") as f: |
| 82 | + modified = f.read() |
95 | 83 |
|
96 | | -### C# Functionality |
| 84 | +# Generate a redline document |
| 85 | +engine = DocxodusEngine() |
| 86 | +redline_bytes, stdout, stderr = engine.run_redline("Reviewer", original, modified) |
97 | 87 |
|
98 | | -The `redlines` C# utility is a command line tool that requires four arguments: |
99 | | -1. `author_tag` - A tag to identify the author of the changes. |
100 | | -2. `original_path.docx` - Path to the original document. |
101 | | -3. `modified_path.docx` - Path to the modified document. |
102 | | -4. `redline_path.docx` - Path where the redlined document will be saved. |
| 88 | +# Save the result |
| 89 | +with open("redline.docx", "wb") as f: |
| 90 | + f.write(redline_bytes) |
103 | 91 |
|
104 | | -The Python wrapper, `XmlPowerToolsEngine` and its main method `run_redline()`, simplifies the use of `redlines` by |
105 | | -orchestrating its execution with Python and letting you pass in bytes or file paths for the original and modified |
106 | | -documents. |
| 92 | +print(stdout) # e.g. "Redline complete: 9 revision(s) found" |
| 93 | +``` |
| 94 | + |
| 95 | +## Architecture Overview |
107 | 96 |
|
108 | | -### Packaging |
| 97 | +Both engines follow the same pattern: a Python wrapper class invokes a self-contained C# binary via subprocess. |
| 98 | +The binary takes four arguments: `<author_tag> <original.docx> <modified.docx> <output.docx>`. |
109 | 99 |
|
110 | | -The project is structured as follows: |
111 | 100 | ``` |
112 | 101 | python-redlines/ |
113 | 102 | │ |
114 | | -├── csproj/ |
115 | | -│ ├── bin/ |
116 | | -│ ├── obj/ |
| 103 | +├── csproj/ # XmlPowerTools C# source |
117 | 104 | │ ├── Program.cs |
118 | | -│ ├── redlines.csproj |
119 | | -│ └── redlines.sln |
| 105 | +│ └── redlines.csproj |
120 | 106 | │ |
121 | | -├── docs/ |
122 | | -│ ├── developer-guide.md |
123 | | -│ └── quickstart.md |
| 107 | +├── docxodus/ # Docxodus git submodule |
| 108 | +│ └── tools/redline/ |
| 109 | +│ ├── Program.cs |
| 110 | +│ └── redline.csproj |
124 | 111 | │ |
125 | 112 | ├── src/ |
126 | 113 | │ └── python_redlines/ |
127 | | -│ ├── bin/ |
128 | | -│ │ └── .gitignore |
129 | | -│ ├── dist/ |
130 | | -│ │ ├── .gitignore |
131 | | -│ │ ├── linux-x64-0.0.1.tar.gz |
132 | | -│ │ └── win-x64-0.0.1.zip |
| 114 | +│ ├── engines.py # BaseEngine, XmlPowerToolsEngine, DocxodusEngine |
| 115 | +│ ├── dist/ # XmlPowerTools compressed binaries |
| 116 | +│ ├── dist_docxodus/ # Docxodus compressed binaries |
| 117 | +│ ├── bin/ # XmlPowerTools extracted binaries (runtime) |
| 118 | +│ ├── bin_docxodus/ # Docxodus extracted binaries (runtime) |
133 | 119 | │ ├── __about__.py |
134 | | -│ ├── __init__.py |
135 | | -│ └── engines.py |
| 120 | +│ └── __init__.py |
136 | 121 | │ |
137 | 122 | ├── tests/ |
138 | | -| ├── fixtures/ |
139 | | -| ├── test_openxml_differ.py |
140 | | -| └── __init__.py |
141 | | -| |
142 | | -├── .gitignore |
143 | | -├── build_differ.py |
144 | | -├── extract_version.py |
145 | | -├── License.md |
| 123 | +│ ├── fixtures/ |
| 124 | +│ ├── test_openxml_differ.py # XmlPowerTools integration test |
| 125 | +│ ├── test_docxodus_engine.py # Docxodus integration test |
| 126 | +│ └── test_engine_contract.py # Shared contract tests for both engines |
| 127 | +│ |
| 128 | +├── build_differ.py # Builds both engines for all platforms |
146 | 129 | ├── pyproject.toml |
147 | 130 | └── README.md |
148 | 131 | ``` |
149 | 132 |
|
150 | | -- `src/your_package/`: Contains the Python wrapper code. |
151 | | -- `dist/`: Contains the zipped C# binaries for different platforms. |
152 | | -- `bin/`: Target directory for extracted binaries. |
153 | | -- `tests/`: Contains test cases and fixtures for the wrapper. |
| 133 | +Pre-compiled binaries for 6 platform targets (linux/win/osx x x64/arm64) are bundled in the wheel for each engine. |
| 134 | +On first use, the appropriate binary is extracted and cached. |
| 135 | + |
| 136 | +### Stdout Differences |
| 137 | + |
| 138 | +The two engines produce slightly different stdout messages: |
| 139 | + |
| 140 | +| Engine | Example stdout | |
| 141 | +|---|---| |
| 142 | +| `XmlPowerToolsEngine` | `Revisions found: 9` | |
| 143 | +| `DocxodusEngine` | `Redline complete: 9 revision(s) found` | |
| 144 | + |
| 145 | +## Development |
| 146 | + |
| 147 | +### Prerequisites |
| 148 | + |
| 149 | +- Python 3.8+ |
| 150 | +- .NET 8.0 SDK (for building C# binaries) |
| 151 | + |
| 152 | +### Setup |
| 153 | + |
| 154 | +```bash |
| 155 | +# Clone with submodules |
| 156 | +git clone --recurse-submodules https://github.com/JSv4/Python-Redlines |
| 157 | +cd Python-Redlines |
| 158 | + |
| 159 | +# If you already cloned without submodules |
| 160 | +git submodule update --init --recursive |
| 161 | +``` |
| 162 | + |
| 163 | +### Commands |
| 164 | + |
| 165 | +```bash |
| 166 | +# Run tests |
| 167 | +hatch run test |
| 168 | + |
| 169 | +# Run a single test |
| 170 | +hatch run test tests/test_openxml_differ.py::test_run_redlines_with_real_files |
| 171 | + |
| 172 | +# Build C# binaries for all platforms |
| 173 | +hatch run build |
| 174 | + |
| 175 | +# Build Python package |
| 176 | +hatch build |
| 177 | +``` |
154 | 178 |
|
155 | | -### Detailed Explanation and Dev Setup |
| 179 | +### Detailed Dev Setup |
156 | 180 |
|
157 | 181 | If you want to contribute to the library or want to dive into some of the C# packaging architecture, go to our |
158 | 182 | [developer guide](docs/developer-guide.md). |
|
0 commit comments