Skip to content

BengaliAI/banglaPOSTagTest

Repository files navigation

banglaPOSTagTest

testing bangla postags

Version: 0.0.1     

LOCAL ENVIRONMENT

OS          : Ubuntu 18.04.3 LTS (64-bit) Bionic Beaver        
Memory      : 7.7 GiB  
Processor   : Intel® Corei5-8250U CPU @ 1.60GHz × 8    
Graphics    : Intel® UHD Graphics 620 (Kabylake GT2)  
Gnome       : 3.28.2  

Setup

python requirements

  • pip3 install -r requirements.txt

Its better to use a virtual environment

Sources

Bangla Test-1: ILMT_TAGSET_TEST

  • Place the Pretrained Model (keras_mlp_bangla.h5) under tests/ILMT_TAGSET_TEST/
  • testing kernel: ilmt_test.ipynb

Data EDA:

  • number of senetnces in the dataset:2927
  • number of total words in the dataset:40554
  • number of unique words in the dataset:12514
  • The word-wise tags and wordcount csv is available at: /tests/ILMT_TAGSET_TEST/tagged_data_wtc.csv
  • Tags Found:33
    'JJ', 'NC', 'PU', 'CCD', 'NP', 
    'VM', 'JQ', 'PRL', 'CX', 'DAB',
    'PPR', 'CSB', 'PP', 'NV', 'CCL', 
    'AMN', 'RDS', 'VAUX', 'NST',
    'ALC', 'PWH', 'RDF', 'PRF', 
    'PRC', 'LC', 'DRL', 'LV', 'DWH', 'CIN',
    'RDX', 'VA', '?', 'CSD'

Feature Format (USED FOR THE PRETRAINED MODEL)

  • A tagged_sentence: defined as the list of tuples of (word,tag)

example:[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'), ('কার্পেট', 'NC'), ('৷', 'PU')]

  • For the tagged_sentences , feature format for each term is as follows :
    {
        'nb_terms'  : number of terms in the sentence,
        'term'      : the specific term,
        'is_first'  : True if the term is the first one in sentence,
        'is_last'   : True if the term is the last ine in sentence,
        'prefix-1'  : term[0],
        'prefix-2'  : term[:2],
        'prefix-3'  : term[:3],
        'suffix-1'  : term[-1],
        'suffix-2'  : term[-2:],
        'suffix-3'  : term[-3:],
        'prev_word' : the previous word,
        'next_word' : the next word
    }

example: for the term: 'রপ্তানি' the feature construction looks as follows

{
    'nb_terms': 16,
    'term': 'রপ্তানি',
    'is_first': True,
    'is_last': False,
    'prefix-1': 'র',
    'prefix-2': 'রপ',
    'prefix-3': 'রপ্',
    'suffix-1': 'ি',
    'suffix-2': 'নি',
    'suffix-3': 'ানি',
    'prev_word': '',
    'next_word': 'দ্রব্য'
}

Model Analysis(BNLTK USED MODEL)

  • The model structre is as follows: (!!!!-Surely we can do better In Shaa Allah)
Model: "sequential_1" 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               24136192  
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_2 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 36)                18468     
=================================================================
Total params: 24,417,316
Trainable params: 24,417,316
Non-trainable params: 0
_________________________

Testing Tags

Category Subcategory Part-of-speech tag
NOUN Common NC.*
Proper NP.*
Verbal NV.*
Spatio-temporal NST
VERB Main VM.*
Auxiliary VA.*
PRONOUN Pronominal PPR.*
Reflexive PRF.*
Reciprocal PRC.*
Relative PRL.*
Wh-pronoun PWH.*
NOMINAL MODIFIER Adjective JJ.*
Quantifier JQ.*
DEMONSTRATIVE Absolute DAB.*
Relative DRL.*
Wh DWH.*
ADVERB Manner AMN.*
Location ALC.*
PARTICIPLE Verbal (Adverbial) LV.*
Conditional LC.*
PARTICLE Coordinating CCD.*
Subordinating CSB.*
Classifier CCL.*
Interjection CIN.*
Others CX.*
Postposition PP
Punctuation PU
RESIDUAL Foreign word RDF
Symbol RDS
Others RDX

References

About

testing bangla postags

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published