word-segmentation

Star

Here are 145 public repositories matching this topic...

google / sentencepiece

Star

Unsupervised text tokenizer for Neural Network-based text generation.

natural-language-processing neural-machine-translation word-segmentation

Updated Dec 5, 2025
C++

baidu / lac

Star

百度NLP：分词，词性标注，命名实体识别，词重要性

python java named-entity-recognition lexical-analysis chinese-nlp word-segmentation part-of-speech-tagger chinese-word-segmentation

Updated May 25, 2021
C++

wolfgarbe / SymSpell

Star

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

spellcheck fuzzy-search fuzzy-matching edit-distance levenshtein levenshtein-distance spelling spell-check chinese-text-segmentation word-segmentation approximate-string-matching spelling-correction damerau-levenshtein text-segmentation chinese-word-segmentation symspell

Updated Nov 5, 2025
C#

PyThaiNLP / pythainlp

Star

Thai natural language processing in Python

python natural-language-processing thai-language thai computational-linguistics text-processing soundex nlp-library word-segmentation thai-nlp hacktoberfest thai-nlp-library thai-soundex hacktoberfest-accepted

Updated Dec 2, 2025
Python

VKCOM / YouTokenToMe

Star

Unsupervised text tokenizer focused on computational efficiency

nlp natural-language-processing word-segmentation tokenization bpe

Updated Mar 29, 2024
C++

mammothb / symspellpy

Sponsor

Star

Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

python spellcheck fuzzy-search fuzzy-matching edit-distance levenshtein levenshtein-distance spelling spell-check chinese-text-segmentation word-segmentation approximate-string-matching spelling-correction damerau-levenshtein text-segmentation chinese-word-segmentation symspell

Updated Nov 28, 2025
Python

ckiplab / ckip-transformers

Star

CKIP Transformers

transformers named-entity-recognition word-segmentation language-model ckip part-of-speech-tagging

Updated Apr 21, 2023
Python

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

nlp tokenizer text-processing semeval nlp-library word-segmentation spelling-correction tokenization text-segmentation spell-corrector word-normalization

Updated Jun 2, 2025
Python

vncorenlp / VnCoreNLP

Star

A Vietnamese natural language processing toolkit (NAACL 2018)

java nlp natural-language-processing parsing vietnamese python3 named-entity-recognition ner word-segmentation pos-tagging dependency-parsing pos-tagger vietnamese-nlp sentence-segmentation vietnamese-tokenizer vncorenlp word-segmenter rdrsegmenter vnmarmot

Updated Feb 12, 2023
Java

bab2min / Kiwi

Sponsor

Star

Kiwi(지능형 한국어 형태소 분석기)

nlp cpp morphology korean word-segmentation morphological-analysis korean-text-processing korean-tokenizer korean-nlp

Updated Dec 1, 2025
C++

JayYip / m3tl

Star

BERT for Multitask Learning

nlp text-classification transformer named-entity-recognition pretrained-models part-of-speech ner word-segmentation bert cws encoder-decoder multi-task-learning multitask-learning

Updated Apr 12, 2023
Jupyter Notebook

modelscope / AdaSeq

Star

AdaSeq: An All-in-One Library for Developing State-of-the-Art Sequence Understanding Models

nlp natural-language-processing crf pytorch information-extraction named-entity-recognition chinese-nlp ner word-segmentation bert sequence-labeling relation-extraction natural-language-understanding entity-typing token-classification multi-modal-ner

Updated Nov 15, 2023
Python

taishi-i / nagisa

Star

A Japanese tokenizer based on recurrent neural networks

nlp natural-language-processing japanese tokenizer nlp-library word-segmentation dynet pos-tagging sequence-labeling

Updated Oct 29, 2025
Python

ku-nlp / jumanpp

Star

Juman++ (a Morphological Analyzer Toolkit)

nlp japanese tokenizer cjk word-segmentation pos-tagging part-of-speech-tagger morphological-analysis pos-tagger morphological-analyser juman

Updated Oct 3, 2023
C++

jacksonllee / pycantonese

Sponsor

Star

Cantonese Linguistics and NLP

python nlp natural-language-processing linguistics cantonese computational-linguistics word-segmentation jyutping pycantonese stop-words part-of-speech-tagging

Updated May 23, 2024
Python

yongzhuo / Pytorch-NLU

Star

中文文本分类、序列标注工具包（pytorch），支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Chinese text classification and sequence labeling toolkit, supports multi class and multi label classification, text similsrity, text summary and NER.