VnCoreNLP for sentence/word segmentation (Python Version, not java)

Python code for sentence/word segmentation using VnCoreNLP and Python

Why using this code ?

Some Engineers want to use a sentence/word segmentation tool from VnCoreNLP without loading whole Java model in VnCoreNLP's official repo and do not want to look at Java code.

Installation

$ pip install regex

Example Usage

A simple example of how to use this tool:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from Tokenizer import Tokenizer
from WordSegmenter import WordSegmenter

wordSegmenter = WordSegmenter()
s = 'VTV đồng ý chia sẻ bản quyền World Cup 2018 cho HTV để khai thác. Nhưng cả hai nhà đài đều phải chờ sự đồng ý của FIFA mới thực hiện được điều này.'
    
print('tokenize: ', wordSegmenter.tokenize(s))
print('sentence_tokenize: ', Tokenizer.sentence_tokenize(s))

And here is the output:

tokenize:  [['VTV', 'đồng_ý', 'chia_sẻ', 'bản_quyền', 'World_Cup', '2018', 'cho', 'HTV', 'để', 'khai_thác', '.'], ['Nhưng', 'cả', 'hai', 'nhà', 'đài', 'đều', 'phải', 'chờ', 'sự', 'đồng_ý', 'của', 'FIFA', 'mới', 'thực_hiện', 'được', 'điều', 'này', '.']]
sentence_tokenize:  ['VTV đồng ý chia sẻ bản quyền World Cup 2018 cho HTV để khai thác .', 'Nhưng cả hai nhà đài đều phải chờ sự đồng ý của FIFA mới thực hiện được điều này .']

##DISCLAIMER

The speed performance of this code is much slower than the official Java code. The purpose of creating this repos is understanding VnCoreNLP by converting it to python programming language. Besides, many Vietnamese NLP tasks just require a sentence and word tokenizer solution, so I decide not to rewrite the NER, Pos Tagging parts of the original codebase.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
FWObject.py		FWObject.py
Node.py		Node.py
README.md		README.md
StringUtils.py		StringUtils.py
Tokenizer.py		Tokenizer.py
Utils.py		Utils.py
Vocabulary.py		Vocabulary.py
WordSegmenter.py		WordSegmenter.py
WordTag.py		WordTag.py
test.py		test.py
vi-vocab		vi-vocab
wordsegmenter.rdr		wordsegmenter.rdr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VnCoreNLP for sentence/word segmentation (Python Version, not java)

Table Of Contents

Why using this code ?

Installation

Example Usage

License

About

Releases

Packages

Languages

duongkstn/vncorenlp_sentence_segmentation

Folders and files

Latest commit

History

Repository files navigation

VnCoreNLP for sentence/word segmentation (Python Version, not java)

Table Of Contents

Why using this code ?

Installation

Example Usage

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages