Simplify Sanakirju XML dataset for easier parsing.
pipenv install
pipenv run python main.py
Will generate simplified XML dataset in src/sanakirju_simplifier/build
The original dataset Sanakirju uses is huge and deeply nested set of XML. Automatically parsing it using common Node.js libraries causes some incorrectly parsed data. One example of these parsing issues would be additional XML-element inside XML-text content. Most parsers pop the element out as its own element, which makes it quite tricky to place its text content back in correct location.
Sanakirju does not really need most of that XML-data; it only needs the text elements inside them.
Most of the problematic tags can just be search/replaced with regex. This simplifier just goes through the whole dataset, and resaves them as new XML-files that have fewer and less-deeply nested elements. In short, the endgoal is to find, replace and remove content that would be incorrectly parsed, while keeping text content inside them.
Words & translations are from Karjalan Kielen Sanakirja created by Institute for the Languages of Finland. The original material is licenced under Creative Commons International (CC BY 4.0).