Python package for normalizing Persian text.
- Text Cleaning
- URL Remover
- Emoji Remover
- Text Tokenization
- Punctuation Space Correction
- Half Space Correction (using Parsivar)
- Standardize Alphabet
- NLTK compatible
- Python 3 support
>>> from khoshnevis import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize(text="استفاده از نیمفاصله متن را زیبا مي كند", zwnj="\u200c",
clean_url=False, remove_emoji=False)
text (str): input text
zwnj (str, optional): Zero-width non-joiner character. Defaults to "\u200c".
clean_url (bool, optional): removes all URLs from text. Defaults to True.
remove_emoji (bool, optional): removes all emojis from the text. Defaults to True.
The latest stable version of Hazm can be installed through pip
:
pip install khoshnevis
@misc{khoshnevis,
author = {HamidReza Attar, Milad Lotfi, Saied Alimoradi},
title = {Khoshnevis, a Python library for Persian text preprocessing},
year = {2022},
url= {https://www.khodnevisai.com/},
}