Cross-Lingual Named Entity Recognition between Hindi and Nepali

This repository hosts the code and resources for conducting cross-lingual transfer learning experiments focused on Named Entity Recognition (NER) between Hindi and Nepali. This work forms part of a research study aimed at evaluating the effectiveness of pre-trained multilingual BERT models in monolingual and cross-lingual contexts, focusing on the linguistic similarities and differences between Hindi and Nepali.

Overview

The core objective of this project is to explore the potential of cross-lingual transfer learning to improve NER performance in a resource-limited language like Nepali by leveraging models pre-trained on a relatively resource-abundant language such as Hindi. The experiments were conducted using the following pre-trained models:

These models were fine-tuned and evaluated on NER tasks using datasets specific to both Hindi and Nepali. The experiments involved monolingual evaluations to gauge performance within each language and cross-lingual evaluations to assess the models' ability to transfer linguistic knowledge across languages.

Repository Structure

crosslingual_ner_hindi_nepali.ipynb: The main notebook for conducting the experiments and analyzing the results.
trainer.py: Script dedicated to training the models on the NER datasets, including functions for fine-tuning and evaluation.
requirements.txt: A list of the Python packages required to run the experiments.

Installation

To set up the environment, install Python 3.8 or later. The required packages can be installed by executing:

pip install -r requirements.txt

The primary dependencies include:

PyTorch (with CUDA support) - for model training and inference.
Hugging Face Transformers - to leverage pre-trained BERT models and fine-tune them for NER tasks.
TensorBoard - for monitoring training metrics and visualizing performance.
Datasets from Hugging Face - for loading and processing the NER datasets.
Scikit-learn - used for evaluation metrics such as precision, recall, and F1 score.
Optuna - for hyperparameter optimization to fine-tune model performance.

Running the Experiments

Prepare the datasets: Before running the experiments, ensure that the Hindi and Nepali NER datasets are prepared and preprocessed in the correct format. The datasets should follow the CoNLL-2003 standard, with entities labeled appropriately. Place the datasets in the designated directories within the repository.
Training: To train the models on the Hindi and Nepali datasets, use the trainer.py script. This script includes functions for fine-tuning the pre-trained BERT models on the NER tasks. You can adjust hyperparameters such as learning rate, batch size, and number of epochs within the script to suit your specific requirements.
Evaluation: Once training is complete, the models will be evaluated on monolingual and cross-lingual NER tasks. The evaluation results, including precision, recall, and F1 scores, will be logged using TensorBoard. These metrics can be accessed and visualized to understand the models' performance across different tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
crosslingual-ner-hindi-nepali.ipynb		crosslingual-ner-hindi-nepali.ipynb
requirements.txt		requirements.txt
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Lingual Named Entity Recognition between Hindi and Nepali

Overview

Repository Structure

Installation

Running the Experiments

About

Releases

Packages

Languages

dpy011/Cross-Lingual-NER-Hindi-Nepali

Folders and files

Latest commit

History

Repository files navigation

Cross-Lingual Named Entity Recognition between Hindi and Nepali

Overview

Repository Structure

Installation

Running the Experiments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages