This parser will extract all data from the the 1.5 GB XML file found here and put it in a SQLite database so that it can be used for further processing.
This parser played a key part in building my final project for Harvard's CS50x course as it allowed me to extract about 100.000 lemmas of the Slovenian language and use them an IndexedDB in a Chrome extension, which was my final project.
Like this:
# download the data
# activate a Python virtual environment
python3 -m venv env && source env/bin/activate
# install all dependencies
pip install -r requirements.txt
# run the parser
python -i -x sloleks_clarin_2.0.xml -v
After about 45 min the data will get transferred to a 1 GB SQLite database called sloleks.db
To extract the data for further use in my Chrome and Firefox extensions I exported it with this SQL query:
SELECT LOWER(fr.zapis_oblike) AS 'word',,
LOWER(l.zapis_oblike) AS 'lemma'
FROM form_representations fr
JOIN word_forms wf on fr.word_form_id =
JOIN lemmas l ON fr.lexical_entry_id = l.lexical_entry_id
WHERE SUBSTR(l.zapis_oblike, 1, 1) NOT IN ('0', '1','2','3','4','5','6','7','8','9')
Sloleks is the reference morphological lexicon for Slovenian language, developed to be used in NLP applications and language manuals. Encoded in LMF XML, the lexicon contains approx. 100,000 most frequent Slovenian lemmas, their inflected or derivative word forms and the corresponding grammatical description. Lemmatization rules, part-of-speech categorization and the set of feature-value pairs follow the JOS morphosyntactic specifications. In addition to grammatical information, each word form is also given the information on its absolute corpus frequency and its compliance with the reference language standard.
More information about Sloleks can be found here.