Skip to content

The official GitHub repository of the automated romanized Bangla back-transliteration system BanglaTLit

License

Notifications You must be signed in to change notification settings

farhanishmam/BanglaTLit

Repository files navigation

BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla

anthology arXiv HuggingFace Poster Slides Video

Md Fahim*, Fariha Tanjim Shifat*, Fabiha Haider*, Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Farhan Ishmam, and Farhad Alam Bhuiyan.

Dataset Overview

  • BanglaTLit-PT: A pre-training corpus with 245727 transliterated or romanized Bangla samples for further pre-training language models.

  • BanglaTLit: Subset of the BanglaTLit-PT dataset containing 42705 romanized Bangla and its corresponding Bangla back-transliteration pairs.

  • Summary statistics of the BanglaTLit dataset are provided below. TL: Transliterated and BTL: Back-Transliterated.

    Statistic TL BTL
    Mean Character Length 59.24 58.28
    Max Character Length 1406 1347
    Min Character Length 3 4
    Mean Word Count 10.35 10.51
    Max Word Count 212 226
    Min Word Count 2 2
    Unique Word Count 81848 60644
    Unique Sentence Count 42705 42471

Methodology Overview

Image Not Found

Our proposed model architecture consists of a dual-encoder setup where the contextualized embeddings are aggregated and passed to the T5 decoder. We use a T5 encoder and a Transliterated Bangla TB encoder i.e. an encoder-based model that is further pre-trained on the BanglaTLit-PT corpus. Feature aggregation is done using summation and alternative strategies have been explored in the ablations.

Quick Start

Further Pre-training on Romanized Bangla Corpus

Romanized Bangla Back-Transliteration

Installation

Create a virtual environment and install all the dependencies. Ensure that you have Python 3.8 or higher installed.

pip install -r requirements.txt

Further Pre-training (Optional)

If you wish to further pre-train the model on your specific dataset, you can do so by running the following script:

python scripts/further_pretraining.py

This step is optional as you can alternatively use the pre-trained model weights provided on HuggingFace.

Further Pre-Trained (FPT) Model Weights

If you prefer not to further pre-train the model, you can directly use the pre-trained weights by downloading them from HuggingFace. Change the model in the configuration to the Hugging Face repository name.

FPT Model Hugging Face Repo
HuggingFace aplycaebous/tb-BERT-fpt
HuggingFace aplycaebous/tb-mBERT-fpt
HuggingFace aplycaebous/tb-XLM-R-fpt
HuggingFace aplycaebous/tb-BanglaBERT-fpt
HuggingFace aplycaebous/tb-BanglishBERT-fpt

Training and Evaluation

To train and evaluate the model on Bangla back-transliteration, use the following command:

python scripts/training_back_transliteration.py

Sample Testing

The trained model can be tested on a given sample by running the following command:

python scripts/inference_back_transliteration.py

About

The official GitHub repository of the automated romanized Bangla back-transliteration system BanglaTLit

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •