ru_punkt

Russian language support for NLTK's PunktSentenceTokenizer

ru_punkt is a part of nltk_data since 2019-07-04

Instalation

Install NLTK python package:

pip install nltk

Download punkt data:

import nltk
nltk.download('punkt')

Usage

import nltk

text = "Ай да А.С. Пушкин! Ай да сукин сын!"
print("Before:", nltk.sent_tokenize(text))
print("After:", nltk.sent_tokenize(text, language="russian"))

Output:

Before: ['Ай да А.С.', 'Пушкин!', 'Ай да сукин сын!']
After: ['Ай да А.С. Пушкин!', 'Ай да сукин сын!']

Training data

Data for sentence tokenization was taken from 3 sources:
– Articles from Russian Wikipedia (about 1 million sentences);
– Common Russian abbreviations from Russian orthographic dictionary, edited by V. V. Lopatin;
– Generated names initials.

Implementation notes

After some research it was found that the single params.abbrev_types performs better than together with params.collocations and params.ortho_content, so the latter were removed from the trained tokenizer.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
nltk_data/tokenizers/punkt		nltk_data/tokenizers/punkt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ru_punkt

Instalation

Usage

Training data

Implementation notes

About

Releases

Packages

Contributors 2

License

Mottl/ru_punkt

Folders and files

Latest commit

History

Repository files navigation

ru_punkt

Instalation

Usage

Training data

Implementation notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages