This repo implements and compare two models, namely a Bidirectional LSTM and TENER (Transformer Encoder for Named Entity Recognition) on the ai4bharat-IndicNER dataset,
Find live demo at https://tener-malayalam-k2i4cw0x9c.streamlit.app/
Please find the pretrained weights at: https://drive.google.com/drive/folders/13DQ7zTz8fiSTkwmScpd8ZuO5mVtC4y_C?usp=sharing
- modeling_TENER.ipynb has the code for training and contains some results as well.
- models/ contains the code definitions for both the models.
- malayalam_ner.py implements a helper class that makes it easier to predict with either of the models.
- predict.py contains code for running inference on a single string.
- Byte Pair Encoding has been used here.
- It was chosen after a comparison between it and fasttext.
- The tokens were also vectorized using it's vectorizer.
- It was taken from BPEmb, which has pretrained embedding models for over 275 languages.
- More details can be found here and for the specific one I used
-
Bidirectional LSTM
- Uses 3 layers, with hidden size of 200.
- Uses ReLU as the activation funcion.
- Combines manually initialized weights and LayerNorm layers for numerical stability.
-
TENER
- Employs an adaptation of TENER
- Compared to the paper, the CRF layer at the end has been dropped.
- Here, I have set
d_model = 512
andn_heads = 16
- A weight vector has been used in the loss function to address for the imbalance of tags in the dataset
-
The highest f1-score obtained with BiLSTM is 0.96 and lowest val_loss of 0.09
-
The highest f1-score obtained with TENER is 0.98 and lowest val_loss of 0.05.
- While testing I have observed that the model performs better on sentences from the test that the headlines or the title.