Might be interesting to get some inspiration from [Whoosh](https://github.com/whoosh-community/whoosh/blob/master/src/whoosh/analysis/ngrams.py) Refactoring tokenizer could be coupled with serialization (#23)