Skip to content

The goal of this project is to translate sentences from Hindi to Telugu using a Transformer-based model. The model is built from scratch using PyTorch and includes various text preprocessing and feature extraction techniques to enhance translation quality.

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



94 Commits

Repository files navigation

Table of Contents

Project Overview

The goal of this project is to translate sentences from Hindi to Telugu using a Transformer-based model. The model is built from scratch using PyTorch and includes various text preprocessing and feature extraction techniques to enhance translation quality.

Folder Descriptions

  • data_model_input/: Contains the dataset file in a tab-separated format with Hindi and Telugu sentence pairs.

  • models/: Contains the Transformer model definition and trained model weights.

  • vocab/: Stores the vocabulary files generated during preprocessing for Hindi (source) and Telugu (target).

  • Provides helper functions like data loading, tokenization, and encoding.

  • Main script to perform data preprocessing, training, and saving the model.

  • Script to translate input sentences from Hindi to Telugu using the trained model.

  • requirements.txt: Lists all the Python packages required to run the project.

  • Documentation on how to use the project, including installation and usage instructions.

  • ** additional nlp techniques folders## Step 2: Training the Model

Step - 1: To train the Transformer model, run:


The script will initialize the model, train it on the dataset, and save the weights to models/model_weights.pth.

Step - 2: Translating New Sentences

To translate new sentences from Hindi to Telugu, use:

python --input "आपका स्वागत है"

This will generate the Telugu translation for the given Hindi sentence.

Machine translation model is in folder machine-translation

  • pre process flow:
data -> data-encoded -> data-bg-cleaned -> data-punctuation-standardized ->
data-number-standardized -> data-lang-cleaned -> data-Html-cleaned ->
data-unprintable-cleaned -> data-invalid-lang-range-cleaned -> data-deaccented -> data-aligned ->
data-tokenized -> data-similarity-score

Data Set

We will be building a hybrid data set (movie subtitle from + OPUS)

Data Collecion

  • Selection of movies
  • Downloading the subtitles
  • Making different files (hindi.txt telugu.txt)

Process (Data Alignment / Data Pre-processing)


  1. Individual Pre-Processing (for each language file separately)
  2. Initial Data Alignment
  3. Post-Alignment Processing and Quality Control
  4. Corpus Creation and Final Pre-Processing

Individual Pre-Processing

  • Subtitle format detection and conversion(Usage of only one source (.srt file)) ( AJ Harsh Vardhan )
  • Character encoding conversion (Convert all files to UTF-8 for consistency Using chardet) ( AJ Harsh Vardhan )
  • Language checking (languageDetection Using langDetect) ( Sushant )
  • Removing background noise. ( AJ Harsh Vardhan )
  • Removing html tags (eg., <i></i>`& [U+202B] and [U+202C]) ( AJ Harsh Vardhan )
  • Tokenization and sentence splitting ( Sushant )
  • Removing unprintable characters ( Srikar )
  • Removing characters outside the language pair ( AJ Harsh Vardhan )
  • Normalizing whitespace (Removal of extra space) ( Srikar )
  • Deaccenting accented characters (Converting the accented to their base form) ( Parth )
  • Standardizing punctuation ( Parth )
  • Standardizing numbers ( Parth )

Initial Data Alignment

  • Length-Based Sentence Alignment. ( Parth )
  • Alignment with Time Overlaps. ( Parth )
  • Combining Length and Time-Based Approaches. ( Parth ) [Actually Above Three steps is just one step we are using hybrid of two algorithm (Using some assumption and Tradeoffs)]
  • Handling Misalignments. ( AJ Harsh Vardhan )
  • Similarity Scoring. ( Sushant )

Post-Alignment Processing and Quality Control

  • File Format ( AJ Harsh Vardhan )
  • Sentence Pair Shuffling ( Sushant )
  • Versioning (AJ Harsh Vardhan)

How to find Current Output (last done/completed is the current output) [Check for continuation]

  • Encoding
  • Bg-Removal
  • Punctuation Standardize
  • Number Standardize

Hindi-to-Telugu Translation Model

This project implements a Hindi-to-Telugu translation model using a Transformer architecture developed from scratch. It involves various NLP preprocessing tasks like POS tagging, TF-IDF vectorization, stop-word removal, transliteration, and more.

Individual Contribution

  • S20220010011 (Alagadapa Jaya Harsh Vardhan):

    Parts of speech tagging (POS tagging), N-grams ( N = 2), Transliterator, Integrated Transformer model by adjusting hyperparameters. Translate Script to utilize model.

  • S20220010166 (Parth Vijay): One hot encoding, Label encoding, FastText, Bag of words, functions

  • S20220010219 (Sushant Kuril): NEL, Gensim word vector, Dependency parse, Vocab script, Model optimization

  • S20220010207 (Srikar Chaturvedula): Term frequency-Inverse document frequency (tf-idf), Stop-Words removal, Named entity recognization (NER), Model optimization


The goal of this project is to translate sentences from Hindi to Telugu using a Transformer-based model. The model is built from scratch using PyTorch and includes various text preprocessing and feature extraction techniques to enhance translation quality.






No releases published


No packages published

Contributors 4

