Step - 1: To train the Transformer model, run:

Project Overview

The goal of this project is to translate sentences from Hindi to Telugu using a Transformer-based model. The model is built from scratch using PyTorch and includes various text preprocessing and feature extraction techniques to enhance translation quality.

Folder Descriptions

data_model_input/: Contains the dataset file in a tab-separated format with Hindi and Telugu sentence pairs.
models/: Contains the Transformer model definition and trained model weights.
vocab/: Stores the vocabulary files generated during preprocessing for Hindi (source) and Telugu (target).
utils.py: Provides helper functions like data loading, tokenization, and encoding.
train.py: Main script to perform data preprocessing, training, and saving the model.
translate.py: Script to translate input sentences from Hindi to Telugu using the trained model.
requirements.txt: Lists all the Python packages required to run the project.
README.md: Documentation on how to use the project, including installation and usage instructions.
** additional nlp techniques folders## Step 2: Training the Model

Step - 1: To train the Transformer model, run:

python train.py

The script will initialize the model, train it on the dataset, and save the weights to models/model_weights.pth.

Step - 2: Translating New Sentences

To translate new sentences from Hindi to Telugu, use:

python translate.py --input "आपका स्वागत है"

This will generate the Telugu translation for the given Hindi sentence.

Machine translation model is in folder machine-translation

pre process flow:

data -> data-encoded -> data-bg-cleaned -> data-punctuation-standardized ->
data-number-standardized -> data-lang-cleaned -> data-Html-cleaned ->
data-unprintable-cleaned -> data-invalid-lang-range-cleaned -> data-deaccented -> data-aligned ->
data-tokenized -> data-similarity-score

Data Set

We will be building a hybrid data set (movie subtitle from OpenSubtitles.org + OPUS)

Data Collecion

Selection of movies
Downloading the subtitles
Making different files (hindi.txt telugu.txt)

Process (Data Alignment / Data Pre-processing)

Description

Individual Pre-Processing (for each language file separately)
Initial Data Alignment
Post-Alignment Processing and Quality Control
Corpus Creation and Final Pre-Processing

Individual Pre-Processing

Initial Data Alignment

Length-Based Sentence Alignment. ( Parth )
Alignment with Time Overlaps. ( Parth )
Combining Length and Time-Based Approaches. ( Parth ) [Actually Above Three steps is just one step we are using hybrid of two algorithm (Using some assumption and Tradeoffs)]
Handling Misalignments. ( AJ Harsh Vardhan )
Similarity Scoring. ( Sushant )

Post-Alignment Processing and Quality Control

File Format ( AJ Harsh Vardhan )
Sentence Pair Shuffling ( Sushant )
Versioning (AJ Harsh Vardhan)

How to find Current Output (last done/completed is the current output) [Check readme.md for continuation]

Encoding
Bg-Removal
Punctuation Standardize
Number Standardize

Hindi-to-Telugu Translation Model

This project implements a Hindi-to-Telugu translation model using a Transformer architecture developed from scratch. It involves various NLP preprocessing tasks like POS tagging, TF-IDF vectorization, stop-word removal, transliteration, and more.

Individual Contribution

S20220010011 (Alagadapa Jaya Harsh Vardhan):

Parts of speech tagging (POS tagging), N-grams ( N = 2), Transliterator, Integrated Transformer model by adjusting hyperparameters. Translate Script to utilize model.
S20220010166 (Parth Vijay): One hot encoding, Label encoding, FastText, Bag of words, train.py functions
S20220010219 (Sushant Kuril): NEL, Gensim word vector, Dependency parse, Vocab script, Model optimization
S20220010207 (Srikar Chaturvedula): Term frequency-Inverse document frequency (tf-idf), Stop-Words removal, Named entity recognization (NER), Model optimization

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
data-Html-cleaned		data-Html-cleaned
data-aligned		data-aligned
data-bg-cleaned		data-bg-cleaned
data-deaccented		data-deaccented
data-encoded		data-encoded
data-invalid-lang-range-cleaned		data-invalid-lang-range-cleaned
data-lang-cleaned		data-lang-cleaned
data-number-standardized		data-number-standardized
data-punctuation-standardized		data-punctuation-standardized
data-similarity_scores		data-similarity_scores
data-tokenized		data-tokenized
data-unprintable-cleaned		data-unprintable-cleaned
data		data
lang-clean-invalid		lang-clean-invalid
machine-translation		machine-translation
tsv structurer scripts		tsv structurer scripts
Group-9.pptx		Group-9.pptx
README.md		README.md
align-data.py		align-data.py
bg-noise-remover.py		bg-noise-remover.py
confidence-calculator.py		confidence-calculator.py
deaccented-characters.py		deaccented-characters.py
encoded.py		encoded.py
language_detection.py		language_detection.py
numbers-standardized.py		numbers-standardized.py
punctuation-standardized.py		punctuation-standardized.py
remove-html-tags.py		remove-html-tags.py
remove-outside-lang-range.py		remove-outside-lang-range.py
similarity_scoring.py		similarity_scoring.py
srt_parser.py		srt_parser.py
stop-words.py		stop-words.py
subtitlesMetadata.txt		subtitlesMetadata.txt
tokenization.py		tokenization.py
unprintable-char-remover.py		unprintable-char-remover.py
whitespace-remover.py		whitespace-remover.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Project Overview

Folder Descriptions

Step - 1: To train the Transformer model, run:

Step - 2: Translating New Sentences

Machine translation model is in folder machine-translation

Data Set

Data Collecion

Process (Data Alignment / Data Pre-processing)

Description

Individual Pre-Processing

Initial Data Alignment

Post-Alignment Processing and Quality Control

How to find Current Output (last done/completed is the current output) [Check readme.md for continuation]

Hindi-to-Telugu Translation Model

Individual Contribution

About

Releases

Packages

Contributors 4

Languages

NLP-24/Statistical-Machine-Translation-Hindi-Telugu-

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Project Overview

Folder Descriptions

Step - 1: To train the Transformer model, run:

Step - 2: Translating New Sentences

Machine translation model is in folder machine-translation

Data Set

Data Collecion

Process (Data Alignment / Data Pre-processing)

Description

Individual Pre-Processing

Initial Data Alignment

Post-Alignment Processing and Quality Control

How to find Current Output (last done/completed is the current output) [Check readme.md for continuation]

Hindi-to-Telugu Translation Model

Individual Contribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages