testing bangla postags
Version: 0.0.1
LOCAL ENVIRONMENT
OS : Ubuntu 18.04.3 LTS (64-bit) Bionic Beaver
Memory : 7.7 GiB
Processor : Intel® Core™ i5-8250U CPU @ 1.60GHz × 8
Graphics : Intel® UHD Graphics 620 (Kabylake GT2)
Gnome : 3.28.2
python requirements
pip3 install -r requirements.txt
Its better to use a virtual environment
- Microsoft India Format Annotation of NLTR DATA
- ILMT/SketchEngnine Format Annotation of NLTR DATA
- Pretrained Model for the second format
- Resources: This folder Holds all the resources shared by Mamun sir in the mail thread
- Place the Pretrained Model (keras_mlp_bangla.h5) under tests/ILMT_TAGSET_TEST/
- testing kernel: ilmt_test.ipynb
- number of senetnces in the dataset:2927
- number of total words in the dataset:40554
- number of unique words in the dataset:12514
- The word-wise tags and wordcount csv is available at:
/tests/ILMT_TAGSET_TEST/tagged_data_wtc.csv
- Tags Found:33
'JJ', 'NC', 'PU', 'CCD', 'NP',
'VM', 'JQ', 'PRL', 'CX', 'DAB',
'PPR', 'CSB', 'PP', 'NV', 'CCL',
'AMN', 'RDS', 'VAUX', 'NST',
'ALC', 'PWH', 'RDF', 'PRF',
'PRC', 'LC', 'DRL', 'LV', 'DWH', 'CIN',
'RDX', 'VA', '?', 'CSD'
- A tagged_sentence: defined as the list of tuples of (word,tag)
example:[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'), ('কার্পেট', 'NC'), ('৷', 'PU')]
- For the tagged_sentences , feature format for each term is as follows :
{
'nb_terms' : number of terms in the sentence,
'term' : the specific term,
'is_first' : True if the term is the first one in sentence,
'is_last' : True if the term is the last ine in sentence,
'prefix-1' : term[0],
'prefix-2' : term[:2],
'prefix-3' : term[:3],
'suffix-1' : term[-1],
'suffix-2' : term[-2:],
'suffix-3' : term[-3:],
'prev_word' : the previous word,
'next_word' : the next word
}
example: for the term: 'রপ্তানি' the feature construction looks as follows
{
'nb_terms': 16,
'term': 'রপ্তানি',
'is_first': True,
'is_last': False,
'prefix-1': 'র',
'prefix-2': 'রপ',
'prefix-3': 'রপ্',
'suffix-1': 'ি',
'suffix-2': 'নি',
'suffix-3': 'ানি',
'prev_word': '',
'next_word': 'দ্রব্য'
}
- The model structre is as follows: (!!!!-Surely we can do better In Shaa Allah)
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 24136192
_________________________________________________________________
activation_1 (Activation) (None, 512) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 512) 262656
_________________________________________________________________
activation_2 (Activation) (None, 512) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 512) 0
_________________________________________________________________
dense_3 (Dense) (None, 36) 18468
=================================================================
Total params: 24,417,316
Trainable params: 24,417,316
Non-trainable params: 0
_________________________
Category | Subcategory | Part-of-speech tag |
NOUN | Common | NC.* |
Proper | NP.* | |
Verbal | NV.* | |
Spatio-temporal | NST | |
VERB | Main | VM.* |
Auxiliary | VA.* | |
PRONOUN | Pronominal | PPR.* |
Reflexive | PRF.* | |
Reciprocal | PRC.* | |
Relative | PRL.* | |
Wh-pronoun | PWH.* | |
NOMINAL MODIFIER | Adjective | JJ.* |
Quantifier | JQ.* | |
DEMONSTRATIVE | Absolute | DAB.* |
Relative | DRL.* | |
Wh | DWH.* | |
ADVERB | Manner | AMN.* |
Location | ALC.* | |
PARTICIPLE | Verbal (Adverbial) | LV.* |
Conditional | LC.* | |
PARTICLE | Coordinating | CCD.* |
Subordinating | CSB.* | |
Classifier | CCL.* | |
Interjection | CIN.* | |
Others | CX.* | |
Postposition | PP | |
Punctuation | PU | |
RESIDUAL | Foreign word | RDF |
Symbol | RDS | |
Others | RDX |