banglaPOSTagTest

testing bangla postags

Version: 0.0.1

LOCAL ENVIRONMENT

OS          : Ubuntu 18.04.3 LTS (64-bit) Bionic Beaver        
Memory      : 7.7 GiB  
Processor   : Intel® Core™ i5-8250U CPU @ 1.60GHz × 8    
Graphics    : Intel® UHD Graphics 620 (Kabylake GT2)  
Gnome       : 3.28.2

Setup

python requirements

pip3 install -r requirements.txt

Its better to use a virtual environment

Sources

Microsoft India Format Annotation of NLTR DATA
ILMT/SketchEngnine Format Annotation of NLTR DATA
Pretrained Model for the second format
Resources: This folder Holds all the resources shared by Mamun sir in the mail thread

Bangla Test-1: ILMT_TAGSET_TEST

Place the Pretrained Model (keras_mlp_bangla.h5) under tests/ILMT_TAGSET_TEST/
testing kernel: ilmt_test.ipynb

Data EDA:

number of senetnces in the dataset:2927
number of total words in the dataset:40554
number of unique words in the dataset:12514
The word-wise tags and wordcount csv is available at: /tests/ILMT_TAGSET_TEST/tagged_data_wtc.csv
Tags Found:33

    'JJ', 'NC', 'PU', 'CCD', 'NP', 
    'VM', 'JQ', 'PRL', 'CX', 'DAB',
    'PPR', 'CSB', 'PP', 'NV', 'CCL', 
    'AMN', 'RDS', 'VAUX', 'NST',
    'ALC', 'PWH', 'RDF', 'PRF', 
    'PRC', 'LC', 'DRL', 'LV', 'DWH', 'CIN',
    'RDX', 'VA', '?', 'CSD'

Feature Format (USED FOR THE PRETRAINED MODEL)

A tagged_sentence: defined as the list of tuples of (word,tag)

example:[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'), ('কার্পেট', 'NC'), ('৷', 'PU')]

For the tagged_sentences , feature format for each term is as follows :

    {
        'nb_terms'  : number of terms in the sentence,
        'term'      : the specific term,
        'is_first'  : True if the term is the first one in sentence,
        'is_last'   : True if the term is the last ine in sentence,
        'prefix-1'  : term[0],
        'prefix-2'  : term[:2],
        'prefix-3'  : term[:3],
        'suffix-1'  : term[-1],
        'suffix-2'  : term[-2:],
        'suffix-3'  : term[-3:],
        'prev_word' : the previous word,
        'next_word' : the next word
    }

example: for the term: 'রপ্তানি' the feature construction looks as follows

{
    'nb_terms': 16,
    'term': 'রপ্তানি',
    'is_first': True,
    'is_last': False,
    'prefix-1': 'র',
    'prefix-2': 'রপ',
    'prefix-3': 'রপ্',
    'suffix-1': 'ি',
    'suffix-2': 'নি',
    'suffix-3': 'ানি',
    'prev_word': '',
    'next_word': 'দ্রব্য'
}

Model Analysis(BNLTK USED MODEL)

The model structre is as follows: (!!!!-Surely we can do better In Shaa Allah)

Model: "sequential_1" 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               24136192  
_________________________________________________________________
activation_1 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
activation_2 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 36)                18468     
=================================================================
Total params: 24,417,316
Trainable params: 24,417,316
Non-trainable params: 0
_________________________

Testing Tags

Category	Subcategory	Part-of-speech tag
NOUN	Common	NC.*
	Proper	NP.*
	Verbal	NV.*
	Spatio-temporal	NST
VERB	Main	VM.*
	Auxiliary	VA.*
PRONOUN	Pronominal	PPR.*
	Reflexive	PRF.*
	Reciprocal	PRC.*
	Relative	PRL.*
	Wh-pronoun	PWH.*
NOMINAL MODIFIER	Adjective	JJ.*
	Quantifier	JQ.*
DEMONSTRATIVE	Absolute	DAB.*
	Relative	DRL.*
	Wh	DWH.*
ADVERB	Manner	AMN.*
	Location	ALC.*
PARTICIPLE	Verbal (Adverbial)	LV.*
	Conditional	LC.*
PARTICLE	Coordinating	CCD.*
	Subordinating	CSB.*
	Classifier	CCL.*
	Interjection	CIN.*
	Others	CX.*
Postposition		PP
Punctuation		PU
RESIDUAL	Foreign word	RDF
	Symbol	RDS
	Others	RDX

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Resources		Resources
tests/ILMT_TAGSET_TEST		tests/ILMT_TAGSET_TEST
utils		utils
.gitignore		.gitignore
README.md		README.md
ilmt_test.ipynb		ilmt_test.ipynb
ilmt_test_results.md		ilmt_test_results.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

banglaPOSTagTest

Setup

Sources

Bangla Test-1: ILMT_TAGSET_TEST

Data EDA:

Feature Format (USED FOR THE PRETRAINED MODEL)

Model Analysis(BNLTK USED MODEL)

Testing Tags

References

About

Releases

Packages

Languages

BengaliAI/banglaPOSTagTest

Folders and files

Latest commit

History

Repository files navigation

banglaPOSTagTest

Setup

Sources

Bangla Test-1: ILMT_TAGSET_TEST

Data EDA:

Feature Format (USED FOR THE PRETRAINED MODEL)

Model Analysis(BNLTK USED MODEL)

Testing Tags

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages