Sanakirju Simplifier

Simplify Sanakirju XML dataset for easier parsing.

Usage

pipenv install

pipenv run python main.py

Will generate simplified XML dataset in src/sanakirju_simplifier/build

Motivation

The original dataset Sanakirju uses is huge and deeply nested set of XML. Automatically parsing it using common Node.js libraries causes some incorrectly parsed data. One example of these parsing issues would be additional XML-element inside XML-text content. Most parsers pop the element out as its own element, which makes it quite tricky to place its text content back in correct location.

Sanakirju does not really need most of that XML-data; it only needs the text elements inside them.

What the simplifier does.

Most of the problematic tags can just be search/replaced with regex. This simplifier just goes through the whole dataset, and resaves them as new XML-files that have fewer and less-deeply nested elements. In short, the endgoal is to find, replace and remove content that would be incorrectly parsed, while keeping text content inside them.

Sources.

Words & translations are from Karjalan Kielen Sanakirja created by Institute for the Languages of Finland. The original material is licenced under Creative Commons International (CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github		.github
src/sanakirju_simpilifier		src/sanakirju_simpilifier
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sanakirju Simplifier

Usage

Motivation

What the simplifier does.

Sources.

About

Contributors 2

Languages

License

stscoundrel/sanakirju-simplifier

Folders and files

Latest commit

History

Repository files navigation

Sanakirju Simplifier

Usage

Motivation

What the simplifier does.

Sources.

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages