ssg
is a syllable segmenter for Thai using Conditional Random Fields. This is part of work from Natural Language Processing Lab @Chula, under the supervision of Dr. Attapol Thamrongrattanarit.
foo@bar~$: pip install ssg
To use,
from ssg import syllable_tokenize
syllable_tokenize('ทดสอบ') # returns ['ทด', 'สอบ']
ssg
also comes with its own CLI.
foo@bar~$: ssg-cli PATH_TO_INPUT PATH_TO_OUTPUT
The model itself is stored in ssg/artifacts/crf3_mix.crfsuite2
.
The dataset used for training is a 5,600,000-character human-annotated subcorpus of the Thai National Corpus, trained using python-crfsuite
- L1 penalty:
1.0
- L2 penalty:
1e-3
- Includes possible transitions that are not observed (
features.possible_transitions
is set toTrue
)
- Sliding window features (all possible character (N-1)-gram on both sides of a potential boundary up to a radius of N on both sides)
- Individual character features (each of the characters surrounding a potential boundary within the window of size N)
--- to be updated ---