- Project Overview
- Folder Structure
- Requirements
- Installation
- Dataset Preparation
- How to Run
- Training the Model
- Translating New Sentences
- Acknowledgments
The goal of this project is to translate sentences from Hindi to Telugu using a Transformer-based model. The model is built from scratch using PyTorch and includes various text preprocessing and feature extraction techniques to enhance translation quality.
-
data_model_input/: Contains the dataset file in a tab-separated format with Hindi and Telugu sentence pairs.
-
models/: Contains the Transformer model definition and trained model weights.
-
vocab/: Stores the vocabulary files generated during preprocessing for Hindi (source) and Telugu (target).
-
utils.py: Provides helper functions like data loading, tokenization, and encoding.
-
train.py: Main script to perform data preprocessing, training, and saving the model.
-
translate.py: Script to translate input sentences from Hindi to Telugu using the trained model.
-
requirements.txt: Lists all the Python packages required to run the project.
-
README.md: Documentation on how to use the project, including installation and usage instructions.
-
** additional nlp techniques folders## Step 2: Training the Model
python train.py
The script will initialize the model, train it on the dataset, and save the weights to models/model_weights.pth.
To translate new sentences from Hindi to Telugu, use:
python translate.py --input "आपका स्वागत है"
This will generate the Telugu translation for the given Hindi sentence.
- pre process flow:
data -> data-encoded -> data-bg-cleaned -> data-punctuation-standardized ->
data-number-standardized -> data-lang-cleaned -> data-Html-cleaned ->
data-unprintable-cleaned -> data-invalid-lang-range-cleaned -> data-deaccented -> data-aligned ->
data-tokenized -> data-similarity-score
We will be building a hybrid data set (movie subtitle from OpenSubtitles.org + OPUS)
- Selection of movies
- Downloading the subtitles
- Making different files (hindi.txt telugu.txt)
- Individual Pre-Processing (for each language file separately)
- Initial Data Alignment
- Post-Alignment Processing and Quality Control
- Corpus Creation and Final Pre-Processing
- Subtitle format detection and conversion(Usage of only one source (.srt file)) ( AJ Harsh Vardhan )
- Character encoding conversion (Convert all files to UTF-8 for consistency Using chardet) ( AJ Harsh Vardhan )
- Language checking (languageDetection Using langDetect) ( Sushant )
- Removing background noise. ( AJ Harsh Vardhan )
- Removing html tags (eg.,
<i></i>`& [U+202B] and [U+202C]
) ( AJ Harsh Vardhan ) - Tokenization and sentence splitting ( Sushant )
- Removing unprintable characters ( Srikar )
- Removing characters outside the language pair ( AJ Harsh Vardhan )
- Normalizing whitespace (Removal of extra space) ( Srikar )
- Deaccenting accented characters (Converting the accented to their base form) ( Parth )
- Standardizing punctuation ( Parth )
- Standardizing numbers ( Parth )
- Length-Based Sentence Alignment. ( Parth )
- Alignment with Time Overlaps. ( Parth )
- Combining Length and Time-Based Approaches. ( Parth ) [Actually Above Three steps is just one step we are using hybrid of two algorithm (Using some assumption and Tradeoffs)]
- Handling Misalignments. ( AJ Harsh Vardhan )
- Similarity Scoring. ( Sushant )
- File Format ( AJ Harsh Vardhan )
- Sentence Pair Shuffling ( Sushant )
- Versioning (AJ Harsh Vardhan)
How to find Current Output (last done/completed is the current output) [Check readme.md for continuation]
- Encoding
- Bg-Removal
- Punctuation Standardize
- Number Standardize
This project implements a Hindi-to-Telugu translation model using a Transformer architecture developed from scratch. It involves various NLP preprocessing tasks like POS tagging, TF-IDF vectorization, stop-word removal, transliteration, and more.
-
S20220010011 (Alagadapa Jaya Harsh Vardhan):
Parts of speech tagging (POS tagging), N-grams ( N = 2), Transliterator, Integrated Transformer model by adjusting hyperparameters. Translate Script to utilize model.
-
S20220010166 (Parth Vijay): One hot encoding, Label encoding, FastText, Bag of words, train.py functions
-
S20220010219 (Sushant Kuril): NEL, Gensim word vector, Dependency parse, Vocab script, Model optimization
-
S20220010207 (Srikar Chaturvedula): Term frequency-Inverse document frequency (tf-idf), Stop-Words removal, Named entity recognization (NER), Model optimization