Skip to content

Commit 5d34e0a

Browse files
authored
Merge pull request #20 from sagorbrur/dev
merging dev with master for bnlp v3.1.0
2 parents 70984dd + 24eeb63 commit 5d34e0a

37 files changed

+430
-82
lines changed

.github/stale.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Number of days of inactivity before an issue becomes stale
2+
daysUntilStale: 60
3+
# Number of days of inactivity before a stale issue is closed
4+
daysUntilClose: 7
5+
# Issues with these labels will never be considered stale
6+
exemptLabels:
7+
- pinned
8+
- security
9+
# Label to use when marking an issue as stale
10+
staleLabel: wontfix
11+
# Comment to post when marking an issue as stale. Set to `false` to disable
12+
markComment: >
13+
This issue has been automatically marked as stale because it has not had
14+
recent activity. It will be closed if no further activity occurs. Thank you
15+
for your contributions.
16+
# Comment to post when closing a stale issue. Set to `false` to disable
17+
closeComment: false

.travis.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,20 @@ os:
1010

1111
# Set the python version to 3.6, 3.7
1212
python:
13-
- "3.5"
1413
- "3.6"
1514
- "3.7"
1615
- "3.8"
1716

1817
# Install the pip dependency
1918
install:
2019
- pip install sentencepiece
21-
- pip install gensim
20+
- pip install gensim==4.0.1
2221
- pip install nltk
2322
- pip install numpy
2423
- pip install scipy
2524
- pip install sklearn-crfsuite
25+
- pip install wasabi
26+
- pip install python-Levenshtein
2627

2728
# Run the unit test
2829
script:

.vscode/settings.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"restructuredtext.confPath": ""
3+
}

CONTRIBUTING.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ We hosted our code repositories in Github for better management of issues and de
3636
- Clone your forked repository locally
3737
(`git clone https://github.com/<your-github-username>/bnlp.git`);
3838
- Run `cd bnlp` to get to the root directory of the `bnlp` code base;
39-
- Install the dependencies (`pip install -r requirements.txt`);
39+
- checkout `dev` branch by `git checkout dev`
4040
- Download the pretrianed models for running tests
4141
(you can find the pretrained model details [here](https://github.com/sagorbrur/bnlp) in Readme
4242

README.md

Lines changed: 49 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,10 @@
33
# Bengali Natural Language Processing(BNLP)
44

55
[![Build Status](https://travis-ci.org/sagorbrur/bnlp.svg?branch=master)](https://travis-ci.org/sagorbrur/bnlp)
6+
[![arXiv](https://img.shields.io/badge/arXiv-2102.00405-b31b1b)](https://arxiv.org/abs/2102.00405)
67
[![PyPI version](https://img.shields.io/pypi/v/bnlp_toolkit)](https://pypi.org/project/bnlp-toolkit/)
78
[![release version](https://img.shields.io/github/v/release/sagorbrur/bnlp)](https://github.com/sagorbrur/bnlp/releases/tag/2.0.0)
8-
[![Support Python Version](https://img.shields.io/badge/python-3.5%7C3.6%7C3.7%7C3.8-brightgreen)](https://pypi.org/project/bnlp-toolkit/)
9+
[![Support Python Version](https://img.shields.io/badge/python-3.6%7C3.7%7C3.8-brightgreen)](https://pypi.org/project/bnlp-toolkit/)
910
[![Documentation Status](https://readthedocs.org/projects/bnlp/badge/?version=latest)](https://bnlp.readthedocs.io/en/latest/?badge=latest)
1011
[![Gitter](https://badges.gitter.im/bnlp_toolkit/community.svg)](https://gitter.im/bnlp_toolkit/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
1112

@@ -15,7 +16,7 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
1516

1617
## Installation
1718

18-
### PIP installer(Python: 3.5, 3.6, 3.7, 3.8 tested okay, OS: linux, windows tested okay )
19+
### PIP installer(Python: 3.6, 3.7, 3.8 tested okay, OS: linux, windows tested okay )
1920

2021
```
2122
pip install bnlp_toolkit
@@ -34,7 +35,7 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
3435
### Download Link
3536

3637
* [Bengali SentencePiece](https://github.com/sagorbrur/bnlp/tree/master/model)
37-
* [Bengali Word2Vec](https://drive.google.com/open?id=1DxR8Vw61zRxuUm17jzFnOX97j7QtNW7U)
38+
* [Bengali Word2Vec](https://drive.google.com/file/d/1cQ8AoSdiX5ATYOzcTjCqpLCV1efB9QzT/view?usp=sharing)
3839
* [Bengali FastText](https://drive.google.com/open?id=1CFA-SluRyz3s5gmGScsFUcs7AjLfscm2)
3940
* [Bengali GloVe Wordvectors](https://github.com/sagorbrur/GloVe-Bengali)
4041
* [Bengali POS Tag model](https://github.com/sagorbrur/bnlp/blob/master/model/bn_pos.pkl)
@@ -45,7 +46,7 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
4546
- [Bengali Wiki Dump](https://dumps.wikimedia.org/bnwiki/latest/)
4647
* SentencePiece Training Vocab Size=50000
4748
* Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
48-
* Word2Vec word embedding dimension = 300
49+
* Word2Vec word embedding dimension = 100, min_count=5, window=5, epochs=10
4950
* To Know Bengali GloVe Wordvector and training process follow [this](https://github.com/sagorbrur/GloVe-Bengali) repository
5051
* Bengali CRF POS Tagging was training with [nltr](https://github.com/abhishekgupta92/bangla_pos_tagger/tree/master/data) dataset with 80% accuracy.
5152
* Bengali CRF NER Tagging was train with [this](https://github.com/MISabic/NER-Bangla-Dataset) data with 90% accuracy.
@@ -129,7 +130,7 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
129130

130131
bwv = BengaliWord2Vec()
131132
model_path = "bengali_word2vec.model"
132-
word = 'আমার'
133+
word = 'গ্রাম'
133134
vector = bwv.generate_word_vector(model_path, word)
134135
print(vector.shape)
135136
print(vector)
@@ -144,20 +145,43 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
144145
bwv = BengaliWord2Vec()
145146
model_path = "bengali_word2vec.model"
146147
word = 'গ্রাম'
147-
similar = bwv.most_similar(model_path, word)
148+
similar = bwv.most_similar(model_path, word, topn=10)
148149
print(similar)
149150

150151
```
151152
- Train Bengali Word2Vec with your own data
152153

154+
Train Bengali word2vec with your custom raw data or tokenized sentences.
155+
156+
custom tokenized sentence format example:
157+
```
158+
sentences = [['আমি', 'ভাত', 'খাই', ''], ['সে', 'বাজারে', 'যায়', '']]
159+
```
160+
Check [gensim word2vec api](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) for details of training parameter
161+
153162
```py
154163
from bnlp import BengaliWord2Vec
155164
bwv = BengaliWord2Vec()
156-
data_file = "raw_text.txt"
165+
data_file = "raw_text.txt" # or you can pass custom sentence tokens as list of list
157166
model_name = "test_model.model"
158167
vector_name = "test_vector.vector"
159-
bwv.train(data_file, model_name, vector_name)
168+
bwv.train(data_file, model_name, vector_name, epochs=5)
169+
170+
171+
```
172+
- Pre-train or resume word2vec training with same or new corpus or tokenized sentences
173+
174+
Check [gensim word2vec api](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) for details of training parameter
175+
176+
```py
177+
from bnlp import BengaliWord2Vec
178+
bwv = BengaliWord2Vec()
160179

180+
trained_model_path = "mytrained_model.model"
181+
data_file = "raw_text.txt"
182+
model_name = "test_model.model"
183+
vector_name = "test_vector.vector"
184+
bwv.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)
161185

162186
```
163187

@@ -184,6 +208,8 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
184208
```
185209
- Train Bengali FastText Model
186210

211+
Check [fasttext documentation](https://fasttext.cc/docs/en/options.html) for details of training parameter
212+
187213
```py
188214
from bnlp.embedding.fasttext import BengaliFasttext
189215

@@ -194,6 +220,17 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
194220
bft.train(data, model_name, epoch)
195221
```
196222

223+
- Generate Vector File from Fasttext Binary Model
224+
```py
225+
from bnlp.embedding.fasttext import BengaliFasttext
226+
227+
bft = BengaliFasttext()
228+
229+
model_path = "mymodel.bin"
230+
out_vector_name = "myvector.txt"
231+
bft.bin2vec(model_path, out_vector_name)
232+
```
233+
197234
* **Bengali GloVe Word Vectors**
198235

199236
We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors</br>
@@ -267,15 +304,17 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
267304

268305
```
269306

307+
270308
## Bengali Corpus Class
271309

272310
* Stopwords and Punctuations
273311
```py
274-
from bnlp.corpus import stopwords, punctuations
312+
from bnlp.corpus import stopwords, punctuations, letters, digits
275313

276-
stopwords = stopwords()
277314
print(stopwords)
278315
print(punctuations)
316+
print(letters)
317+
print(digits)
279318

280319
```
281320

@@ -285,7 +324,6 @@ BNLP is a natural language processing toolkit for Bengali Language. This tool wi
285324
from bnlp.corpus import stopwords
286325
from bnlp.corpus.util import remove_stopwords
287326

288-
stopwords = stopwords()
289327
raw_text = 'আমি ভাত খাই।'
290328
result = remove_stopwords(raw_text, stopwords)
291329
print(result)

bnlp/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__="3.0.0"
1+
__version__="3.1.0"
22

33

44
import os
@@ -13,3 +13,4 @@
1313

1414

1515

16+
540 Bytes
Binary file not shown.
620 Bytes
Binary file not shown.

bnlp/__pycache__/ner.cpython-37.pyc

3.21 KB
Binary file not shown.

bnlp/__pycache__/ner.cpython-38.pyc

3.28 KB
Binary file not shown.

bnlp/__pycache__/pos.cpython-37.pyc

3.01 KB
Binary file not shown.

bnlp/__pycache__/pos.cpython-38.pyc

3.06 KB
Binary file not shown.
Binary file not shown.

bnlp/corpus/__init__.py

Lines changed: 64 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,75 @@
88
- Bengali Stopwords
99
Collected from: https://github.com/stopwords-iso/stopwords-bn
1010
11+
- Bengali letters and vowel mark
12+
collected from https://github.com/MinhasKamal/BengaliDictionary/blob/master/BengaliCharacterCombinations.txt
1113
1214
"""
1315

14-
from bnlp.corpus.util import stopwords
16+
1517
# return list of bengali stopwords
16-
stopwords = stopwords
18+
stopwords = [
19+
'অতএব', 'অথচ', 'অথবা', 'অনুযায়ী', 'অনেক', 'অনেকে',
20+
'অনেকেই', 'অন্তত', 'অন্য', 'অবধি', 'অবশ্য', 'অর্থাত', 'আই',
21+
'আগামী', 'আগে', 'আগেই', 'আছে', 'আজ', 'আদ্যভাগে', 'আপনার',
22+
'আপনি', 'আবার', 'আমরা', 'আমাকে', 'আমাদের', 'আমার', 'আমি',
23+
'আর', 'আরও', 'ই', 'ইত্যাদি', 'ইহা', 'উচিত', 'উত্তর', 'উনি',
24+
'উপর', 'উপরে', 'এ', 'এঁদের', 'এঁরা', 'এই', 'একই', 'একটি',
25+
'একবার', 'একে', 'এক্', 'এখন', 'এখনও', 'এখানে', 'এখানেই',
26+
'এটা', 'এটাই', 'এটি', 'এত', 'এতটাই', 'এতে', 'এদের', 'এব',
27+
'এবং', 'এবার', 'এমন', 'এমনকী', 'এমনি', 'এর', 'এরা', 'এল',
28+
'এস', 'এসে', 'ঐ', 'ও', 'ওঁদের', 'ওঁর', 'ওঁরা', 'ওই', 'ওকে',
29+
'ওখানে', 'ওদের', 'ওর', 'ওরা', 'কখনও', 'কত', 'কবে', 'কমনে',
30+
'কয়েক', 'কয়েকটি', 'করছে', 'করছেন', 'করতে', 'করবে', 'করবেন',
31+
'করলে', 'করলেন', 'করা', 'করাই', 'করায়', 'করার', 'করি',
32+
'করিতে', 'করিয়া', 'করিয়ে', 'করে', 'করেই', 'করেছিলেন', 'করেছে',
33+
'করেছেন', 'করেন', 'কাউকে', 'কাছ', 'কাছে', 'কাজ', 'কাজে',
34+
'কারও', 'কারণ', 'কি', 'কিংবা', 'কিছু', 'কিছুই', 'কিন্তু', 'কী',
35+
'কে', 'কেউ', 'কেউই', 'কেখা', 'কেন', 'কোটি', 'কোন', 'কোনও',
36+
'কোনো', 'ক্ষেত্রে', 'কয়েক', 'খুব', 'গিয়ে', 'গিয়েছে', 'গিয়ে', 'গুলি', 'গেছে',
37+
'গেল', 'গেলে', 'গোটা', 'চলে', 'চান', 'চায়', 'চার', 'চালু', 'চেয়ে',
38+
'চেষ্টা', 'ছাড়া', 'ছাড়াও', 'ছিল', 'ছিলেন', 'জন', 'জনকে', 'জনের',
39+
'জন্য', 'জন্যওজে', 'জানতে', 'জানা', 'জানানো', 'জানায়', 'জানিয়ে',
40+
'জানিয়েছে', 'জে', 'জ্নজন', 'টি', 'ঠিক', 'তখন', 'তত', 'তথা', 'তবু',
41+
'তবে', 'তা', 'তাঁকে', 'তাঁদের', 'তাঁর', 'তাঁরা', 'তাঁাহারা', 'তাই', 'তাও',
42+
'তাকে', 'তাতে', 'তাদের', 'তার', 'তারপর', 'তারা', 'তারৈ', 'তাহলে',
43+
'তাহা', 'তাহাতে', 'তাহার', 'তিনঐ', 'তিনি', 'তিনিও', 'তুমি', 'তুলে',
44+
'তেমন', 'তো', 'তোমার', 'থাকবে', 'থাকবেন', 'থাকা', 'থাকায়', 'থাকে',
45+
'থাকেন', 'থেকে', 'থেকেই', 'থেকেও', 'দিকে', 'দিতে', 'দিন', 'দিয়ে',
46+
'দিয়েছে', 'দিয়েছেন', 'দিলেন', 'দু', 'দুই', 'দুটি', 'দুটো', 'দেওয়া', 'দেওয়ার',
47+
'দেওয়া', 'দেখতে', 'দেখা', 'দেখে', 'দেন', 'দেয়', 'দ্বারা', 'ধরা', 'ধরে',
48+
'ধামার', 'নতুন', 'নয়', 'না', 'নাই', 'নাকি', 'নাগাদ', 'নানা', 'নিজে',
49+
'নিজেই', 'নিজেদের', 'নিজের', 'নিতে', 'নিয়ে', 'নিয়ে', 'নেই', 'নেওয়া',
50+
'নেওয়ার', 'নেওয়া', 'নয়', 'পক্ষে', 'পর', 'পরে', 'পরেই', 'পরেও', 'পর্যন্ত',
51+
'পাওয়া', 'পাচ', 'পারি', 'পারে', 'পারেন', 'পি', 'পেয়ে', 'পেয়্র্', 'প্রতি',
52+
'প্রথম', 'প্রভৃতি', 'প্রযন্ত', 'প্রাথমিক', 'প্রায়', 'প্রায়', 'ফলে', 'ফিরে', 'ফের',
53+
'বক্তব্য', 'বদলে', 'বন', 'বরং', 'বলতে', 'বলল', 'বললেন', 'বলা', 'বলে',
54+
'বলেছেন', 'বলেন', 'বসে', 'বহু', 'বা', 'বাদে', 'বার', 'বি', 'বিনা', 'বিভিন্ন',
55+
'বিশেষ', 'বিষয়টি', 'বেশ', 'বেশি', 'ব্যবহার', 'ব্যাপারে', 'ভাবে', 'ভাবেই',
56+
'মতো', 'মতোই', 'মধ্যভাগে', 'মধ্যে', 'মধ্যেই', 'মধ্যেও', 'মনে', 'মাত্র',
57+
'মাধ্যমে', 'মোট', 'মোটেই', 'যখন', 'যত', 'যতটা', 'যথেষ্ট', 'যদি', 'যদিও',
58+
'যা', 'যাঁর', 'যাঁরা', 'যাওয়া', 'যাওয়ার', 'যাওয়া', 'যাকে', 'যাচ্ছে', 'যাতে',
59+
'যাদের', 'যান', 'যাবে', 'যায়', 'যার', 'যারা', 'যিনি', 'যে', 'যেখানে', 'যেতে',
60+
'যেন', 'যেমন', 'র', 'রকম', 'রয়েছে', 'রাখা', 'রেখে', 'লক্ষ', 'শুধু', 'শুরু',
61+
'সঙ্গে', 'সঙ্গেও', 'সব', 'সবার', 'সমস্ত', 'সম্প্রতি', 'সহ', 'সহিত', 'সাধারণ',
62+
'সামনে', 'সি', 'সুতরাং', 'সে', 'সেই', 'সেখান', 'সেখানে', 'সেটা', 'সেটাই',
63+
'সেটাও', 'সেটি', 'স্পষ্ট', 'স্বয়ং', 'হইতে', 'হইবে', 'হইয়া', 'হওয়া', 'হওয়ায়',
64+
'হওয়ার', 'হচ্ছে', 'হত', 'হতে', 'হতেই', 'হন', 'হবে', 'হবেন', 'হয়', 'হয়তো',
65+
'হয়নি', 'হয়ে', 'হয়েই', 'হয়েছিল', 'হয়েছে', 'হয়েছেন', 'হল', 'হলে', 'হলেই',
66+
'হলেও', 'হলো', 'হাজার', 'হিসাবে', 'হৈলে', 'হোক', 'হয়'
67+
]
1768

1869
# return list of bengali punctuation
1970
punctuations = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~।ঃ'
2071

72+
# return bangla letters
73+
letters = 'অআইঈউঊঋএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহড়ঢ়য়ৎংঃঁ'
74+
75+
# return bangla digits
76+
digits = '০১২৩৪৫৬৭৮৯'
77+
78+
# bengali vower mark
79+
vower_mark = 'া ি ী ু ৃ ে ৈ ো ৌ'
80+
81+
82+
8.77 KB
Binary file not shown.

0 commit comments

Comments
 (0)