Explore the docs »
View Demo · Report Bug · Request Feature
Implemented takeaways from the original transformer paper "Attention is all you need" and deployed an idea to code a language translator /converter that extracts data from the epochs entered in one language , trains it and convenes the translation of one language into another. The proposed automated English-to-local-language system architecture is designed according to the transfer-based approach to automating translation. At the first layer of the architecture is a natural language processing tool that performs morphological analysis: sentence tokenization, part-of-speech tagging, phrase formation and figure of speech tagging. The second layer of the architecture comprises of a grammar generator that is responsible for converting English language structure to target local language structures. At the third layer of the application, results produced by the grammar generator are mapped with matching terms in the bilingual dictionary. The architectures for a transfer-based model and automated German-English and English-Hindi translator are trained respectively as per as the above mentioned procedure.
Architecture: The Image depicts flowchart for working of Encoder and Decoder based architecture from Vaswani et. al.
Image taken from Research Paper »
Finding libraries to tokenise and setup vocab was fairly simple . We used Spacy for the same purpose and trained the model on Multi30k dataset for 150 epochs . One important thing to note is value of some parameters such as HeadCount and Encoder-Decoder iterations where reduced in comparison to original paper considering limited computational power we had at our disposal. . Prerained weights for GerToEng can be found here
Tokenising and setting up the vocab was quite difficult given the complexities of Hindi grammers , but task was done using INLTK lib. The model was trained on english & hindi parallel corpus from Dataset and more 80k lines from 24 lakh parallel corpus from Dataset , preprocessed and cleaned version can be found here. Same goes here too ,Value of some parameters such as HeadCount and Encoder-Decoder iterations were reduced in comparison to original paper considering limited computational power we had at our disposal. Prerained weights for EngToHin can be found here
- TorchText
- Spacy
- INLTK
- Nvidia cuda toolkit
Step 1. Clone the repository.
Step 2. Download the dataset from Here and place it in the respective data file. Remember both the translation pipelines have diffterent data folder
-
Python 3.7
-
Install python libraries
conda install -r requirements.txt
-
Run training file (engtohindi.py or gertoeng.py) , it will build tokenised vocab for you.(this process is needed to be done only once)
-
Add check points to the Folder: Download the checkpoints from Here and place it in the respective checkpoints file. Remember both the translation pipelines have diffterent checkpoints folder
-
Edit the sentence variable in the eval.py file , for the sentce you want to translate .
-
run eval.py file
python GerToEng/eval.py
- Start from scratch
python GerToEng/GermanToEnglish.py
- To resume training : change load parameter in hyperparameter file to true , the model will automatically load the checkpoints
python GerToEng/GermanToEnglish.py
- Attention is all you need paper»
- dataset for English and Hindi parallel corpus »
- Transformers base implementation »
- Contact Yatharth Kapadia @yatharthk2.nn@gmail.com
- Contact Abhinav Chandra @abhinavchandra0526@gmail.com
- contact Siddharth Jain @jainsiddharth641@gmail.com