This repository was archived by the owner on Mar 9, 2023. It is now read-only.
This repository was archived by the owner on Mar 9, 2023. It is now read-only.
Training Sudachi on new data #159
Open
Description
If I understand correctly, Sudachi is a lattice-based tokenizer and uses the occurrence probabilities and left-right probabilities (costs) for finding the best token sequence.
We would like to know whether we could customize these cost values. I imagine that in a niche domain like biomedicine with many unknown bacteria/disease names, we need domain-specific values to have the best tokenizer.
Metadata
Metadata
Assignees
Labels
No labels