Skip to content

Ionio-io/PL-BERT-Fine-Tuned-hi-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PL-BERT on Hindi Wikipedia Dataset

Welcome to the PL-BERT training project! This repository contains code and resources for fine-tuning the PL-BERT model on a Hindi dataset extracted from the "wiki40b" dataset. Our goal is to enhance the model's ability to understand and generate Hindi text using a high-quality, cleaned Wikipedia dataset.

The fine-tuned training step file has been pushed and can be found on Hugging Face Hub for easy access and deployment.

Dataset: Wiki40b

Overview

The Wiki40b dataset offers a cleaned-up collection of Wikipedia articles in over 40 languages. It excludes disambiguation pages, redirects, and other non-entity content, focusing on relevant, high-quality text.

For this project, we utilized the Hindi subset, which consists of 51,000 rows. Each entry includes a processed Wikipedia page and its corresponding Wikidata ID.

  • Dataset Name: wiki40b
  • Language: Hindi
  • Size: 51,000 rows
  • Content: Cleaned Wikipedia articles with associated Wikidata IDs

Model: PL-BERT

PL-BERT is a powerful multilingual variant of BERT pre-trained on extensive text corpora. We fine-tuned this model on our Hindi dataset to boost its proficiency in understanding and generating Hindi text.

Training Details

  • Model: PL-BERT
  • Dataset: Hindi subset from Wiki40b
  • Batch Size: 64
  • Mixed Precision: FP16
  • Optimizer: AdamW
  • Training Steps: 15,000

Training Progress

  • Final Loss: 1.879
  • Vocabulary Loss: 0.49
  • Token Loss: 1.465

Validation Results

During training, we monitored performance with validation metrics:

  • Validation Loss: 1.879
  • Vocabulary Accuracy: 78.54%
  • Token Accuracy: 82.30%

Contributing

We welcome contributions! If you have improvements or suggestions, please open a pull request or submit an issue.

Feel free to reach out if you have any questions or need further information.

About

Fine tuned the PL-BERT model on custom Hindi Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published