Welcome to the PL-BERT training project! This repository contains code and resources for fine-tuning the PL-BERT model on a Hindi dataset extracted from the "wiki40b" dataset. Our goal is to enhance the model's ability to understand and generate Hindi text using a high-quality, cleaned Wikipedia dataset.
The fine-tuned training step file has been pushed and can be found on Hugging Face Hub for easy access and deployment.
The Wiki40b dataset offers a cleaned-up collection of Wikipedia articles in over 40 languages. It excludes disambiguation pages, redirects, and other non-entity content, focusing on relevant, high-quality text.
For this project, we utilized the Hindi subset, which consists of 51,000 rows. Each entry includes a processed Wikipedia page and its corresponding Wikidata ID.
- Dataset Name: wiki40b
- Language: Hindi
- Size: 51,000 rows
- Content: Cleaned Wikipedia articles with associated Wikidata IDs
PL-BERT is a powerful multilingual variant of BERT pre-trained on extensive text corpora. We fine-tuned this model on our Hindi dataset to boost its proficiency in understanding and generating Hindi text.
- Model: PL-BERT
- Dataset: Hindi subset from Wiki40b
- Batch Size: 64
- Mixed Precision: FP16
- Optimizer: AdamW
- Training Steps: 15,000
- Final Loss: 1.879
- Vocabulary Loss: 0.49
- Token Loss: 1.465
During training, we monitored performance with validation metrics:
- Validation Loss: 1.879
- Vocabulary Accuracy: 78.54%
- Token Accuracy: 82.30%
We welcome contributions! If you have improvements or suggestions, please open a pull request or submit an issue.
Feel free to reach out if you have any questions or need further information.