This is a basic wrapper for multiple Dutch compound splitters. The purpose of this wrapper is to:
- provide a unified API for multiple compound splitters. The package offers a simple socket server and a Flask application for this purpose.
- evaluate the accuracy of different compound splitters
The package was initially developed for T-scan, a natural language analysis application intended for research. For T-scan, we required that users could choose between different algorithms (hence the need for a unified API), and some evaluation of the quality of those algorithms.
The resulting package is useful if you want to run a compound splitting service (e.g. as part of an API or web application), or if you want to evaluate compound splitter methods. Adding new methods, even ones that are not python packages, should be feasible if you have programming experience.
If you are looking for a simple, lightweight python package for compound splitting, this is not it. compound-word-splitter may be a good alternative for you.
The following compound splitters are included:
compound-splitter-nl
, developed by Katja Hoffman, Valentin Jijkoun, Jaap Kamps, and Christof Monz (LGPL-3.0 license). See https://web.archive.org/web/20200813005715/https://ilps.science.uva.nl/resources/compound-splitter-nl/ for the archived website and https://github.com/bminixhofer/ilps-nl-splitter for an archive of the source code.- SECOS, developed by Martin Riedel and Chris Biemann (Apache-2.0 license). See https://github.com/riedlma/SECOS
- MCS, developed by Patrick Ziering. See https://www.ims.uni-stuttgart.de/en/research/resources/tools/mcs/
As a baseline, we also include a "never" algorithm, which never splits.
- Python 3.6+
- Java (only required for MCS)
compound-splitters-nl
is available as a python package, which includes all the data for all included compound splitter methods. This complete package is too large to be registered on PyPI, but you can download the package from our releases.
The archived package can be installed via pip by installing the local file:
pip install compound-splitters-nl-*.tar.gz
# or substitute with your file path
If you want to use the web API, you will need to install additional dependencies:
pip install compound-splitters-nl-*.tar.gz[web_api]
You can also clone the source code repository. In this case, you will still need to download and unpack the data needed for the compound splitter methods. Run installation with:
pip install -r requirements.txt
python retrieve.py
python prepare.py
python -m unittest discover tests/
This will evaluate the different algorithms using the reference files in test_sets
.
python -m compound_splitter.evaluate
python -m compound_splitter.api_web
GET /list
Lists the splitting methods.
GET /split/<method_name>/<compound>
Splits the compound using the specified method.
python -m compound_splitter.socket_server
$ telnet localhost 7005
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
bedrijfsaansprakelijkheidsverzekering,secos
bedrijfs,aansprakelijkheids,verzekeringConnection closed by foreign host.