Skip to content
HemaDevaSagar35 edited this page Aug 23, 2019 · 2 revisions

Introduction

Few months back when trying to search for Subtitles for a French Movie, I got an idea to build a mini-version of Neural Machine Translation system for French - English and see how it feels to build one. Courses CS 224n : Natural Language Processing with Deep learning and Sequence models from Coursera helped a lot in understanding Sequence models, although there is a long way to go!

Setup for experimentation

Knowing that my Laptop doesn't have great configuration to train deep neural networks, I planned my experimentation on GCP. FYI, for a first time user free credits worth 300$ will be given. Lot of articles are online which shows step-by-step procedure for setting-up a GCP instance powered with GPU. The article in the link explains the steps in very sane manner.

Neural Machine Translation

NMT model I attempted to build here belongs to the family of encoder-decoder models, but with addition of attention to learn alignment and translation jointly. With Pure encoder-decoder model the issue is that, like mentioned in the paper, the network's encoder tries to compress all the information into fixed length vector and then the decoder attempts to translate which makes it difficult for the overall network to cope up with long sentences. Wheareas the Model used here following the paper encodes the input senctence into sequence of vectors and chooses a subset of these vectors adapatively through attention while decoding the translation.

Courses like Sequence models from Cousera and CS 224n are very helpful in understanding differences vetween RNN's, LSTM's, GRU's, Bidirectional units intuition on encoders, decoders, attention mechanism etc. NMT model built here for experimentation is a modified version of the network published in the paper. Shrinked version was built to reduce training time which makes it possible to conduct more experiments, given the budget was limited! Specifications of the architecture built can be seen in the python files.

Dataset for the task

Used the paralled corpora from french-english provided by ACL WMT'14, link here. This dataset contains some 40M odd samples. Sampled approximately 140K samples for training and 2.3K for testing the model.

Training the model

The encoder part of the model is composed of Bidirectional LSTM, whereas the decoder part of the model is composed of unidirectional LSTM with 512 hidden units each. An input length 30 words is used, further expermentation should be done by increasing this, and a total vocabulary of size 30K is considered for both French and English individually. An embedding of size 100 dimension is used.

Trained the model for almost 5 days till 400 epochs. Used beam search to find a translation that approximately maximizes the conditional probability, link to the paper, and obtained a Bleu score of 10.9 on the above sampled test set

Some examples to show

Future work

  • Visualize the model's attention on different words while translating a source language sample (language being French here)
  • Play with the architecture and its hyperparameters like for example, changing the dimensions and hidden states units of embeddings and LSTM's respectively, varying the input sentence length, Stacking more layers on the encoder and decoder etc.
  • Explore [Attention is all you need] and implement a version of this for the task of translation

References

Clone this wiki locally