Skip to content

ACallglad/SPRECHEN

 
 

Repository files navigation

Contributors Forks

Issues


Logo

A state-of-the-art Translator to translate EnglishToHindi and GermanToEnglish


Explore the docs »

View Demo · Report Bug · Request Feature

About The Project

Implemented takeaways from the original transformer paper "Attention is all you need" and deployed an idea to code a language translator /converter that extracts data from the epochs entered in one language , trains it and convenes the translation of one language into another. The proposed automated English-to-local-language system architecture is designed according to the transfer-based approach to automating translation. At the first layer of the architecture is a natural language processing tool that performs morphological analysis: sentence tokenization, part-of-speech tagging, phrase formation and figure of speech tagging. The second layer of the architecture comprises of a grammar generator that is responsible for converting English language structure to target local language structures. At the third layer of the application, results produced by the grammar generator are mapped with matching terms in the bilingual dictionary. The architectures for a transfer-based model and automated German-English and English-Hindi translator are trained respectively as per as the above mentioned procedure.

Architecture: The Image depicts flowchart for working of Encoder and Decoder based architecture from Vaswani et. al.

Image taken from Research Paper »

Logo

GermanToEnglish :

Finding libraries to tokenise and setup vocab was fairly simple . We used Spacy for the same purpose and trained the model on Multi30k dataset for 150 epochs . One important thing to note is Value of some parameters such as HeadCount and Encoder-Decoder iterations where reduced in comparison to original paper considering limited computational power we had at our disposal. . Prerained weights for GerToEng can be found here

EnglishToHindi :

Tokenising and setting up the vocab was quite difficult given the complexities of Hindi grammers , but task was done using INLTK lib. The model was trained on english & hindi parallel corpus from Dataset and more 80k lines from 24 lakh parallel corpus from Dataset , preprocessed and cleaned version can be found here. Same goes here too ,Value of some parameters such as HeadCount and Encoder-Decoder iterations where reduced in comparison to original paper considering limited computational power we had at our disposal. Prerained weights for EngToHin can be found here

Built With

  1. TorchText
  2. Spacy
  3. INLTK
  4. Nvidia cuda toolkit

Getting Started

Step 1. Clone the repository.

Step 2. Download the dataset from Here and place it in the respective data file. Remember both the translation pipelines have diffterent data folder

Installation

  • Python 3.7

  • Install python libraries

    conda install -r requirements.txt

Testing

  • Run training file (engtohindi.py or gertoeng.py) , it will build tokenised vocab for you.(this process is needed to be done only once)

  • Add check points to the Folder

    Download the checkpoints from <a href="https://drive.google.com/drive/folders/1ZezM4OWqsdPhYHQdbgmzaIP2BzDSNaKM?usp=sharing">Here</a> and place it in the respective checkpoints   file. Remember both the translation pipelines have diffterent checkpoints folder
  • Edit the sentence variable in the eval.py file , for the sentce you want to translate .

  • run eval.py file

    python GerToEng/eval.py

Training

  • Start from scratch
    python GerToEng/GermanToEnglish.py
  • To resume training : change load parameter in hyperparameter file to true , the model will automatically load the checkpoints
    python GerToEng/GermanToEnglish.py

References

  1. Attention is all you need paper»
  2. dataset for English and Hindi parallel corpus »
  3. Transformers base implementation »

MadeBy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%