A Docker image to lemmatize German texts.
Built upon:
- IWNLP uses the crowd-generated token tables on de.wikitionary.
- GermaLemma: Looks up lemmas in the TIGER Corpus and uses Pattern as a fallback for some rule-based lemmatizations.
It works as follows. First spaCy tags the token with POS. Then German Lemmatizer
looks up lemmas on IWNLP and GermanLemma. If they disagree, choose the one from IWNLP. If they agree or only one tool finds it, take it. Try to preserve the casing of the original token.
You may want to use the Python wrapper: German Lemmatizer
- Install Docker.
-
Read and accept the license terms of the TIGER Corpus (free to use for non-commercial purposes).
-
Start Docker.
-
To execute, you have two options:
- To lemmatize a string from the termial, run:
docker run -it filter/german-lemmatizer:0.5.0 "Was ist das für ein Leben?" [--remove_stop]
- To lemmatize a collection of text, add two local folders to the docker container (NB: you have to give absolute paths):
docker run -it -v $(pwd)/some_input_folder:/input -v $(pwd)/some_output_folder:/output filter/german-lemmatizer:0.5.0 [--line] [--escape] [--remove_stop]
With
--line
each line is treated as a single document instead of the whole file.With
--escape
The newlines are escaped ('\n' -> '\\n') for each document (per line), so the text in the input file has to be processed like this.--remove_stop
removes stop words as defined by spaCy.
Everything – all the code and all the data – is packaged in the Docker image. This means that every lemmatization is reproduceable. For the future, I may update the code and/or data but each images is tagged with a specific version.
- Tried to base in on an Docker Apline Image but there were too many installation hassels.
- Tried to parallelise with joblib but it created too much overhead
- To build an image run
docker build -t lemma .
in this folder - For debugging purposes, you may want enter the container and override the entry point:
docker run -it --entrypoint /bin/bash lemma
docker build -t filter/german-lemmatizer:0.5.0 .
anddocker push filter/german-lemmatizer:0.5.0
MIT.
This work was created as part of a project that was funded by the German Federal Ministry of Education and Research.