This project adheres to Semantic Versioning.
Fixed problems with edge cases in the IPA tokenization.
Bugfix: See cldf#46
Bugfix: Suppress csvw's UserWarning about unknown columns in orthography profiles with more than the default columns.
- Dropped py2 support
- Added compat for clldutils 3.x
- Fixed a bug where NULL values in orthography profiles could not be read when the profile was initialized with Unicode normalization.
segments
now supports orthography profiles described by CSVW metadata.
Orthography profiles and the input of Tokenizer.__call__
is no longer Unicode normalized
by default. I.e. the user is responsible for making sure profiles and tokenization
input are normalized correspondingly. Alternatively, profile data can be normalized
by passing a form
keyword argument when initializing a Profile
instance. But
also in this case, tokenization input must be normalized by the user.
While this results in a more cumbersome API, it gives the user in full control, e.g. to avoid incorrect segmentation when parts of decomposed graphemes are appended to preseding grapheme clusters.