Few months back when trying to search for Subtitles for a French Movie, I got an idea to build a mini-version of Neural Machine Translation system for French - English and see how it feels to build one. Courses CS 224n : Natural Language Processing with Deep learning and Sequence models from Coursera helped a lot in understanding Sequence models, although there is a long way to go!
Knowing that my Laptop doesn't have great configuration to train deep neural networks, I planned my experimentation on GCP. FYI, for a first time user free credits worth 300$ will be given. Lot of articles are online which shows step-by-step procedure for setting-up a GCP instance powered with GPU. The article in the link explains the steps in very sane manner.
NMT model I attempted to build here belongs to the family of encoder-decoder models, but with addition of attention to learn alignment and translation jointly. With Pure encoder-decoder model the issue is that, like mentioned in the paper, the network's encoder tries to compress all the information into fixed length vector and then the decoder attempts to translate which makes it difficult for the overall network to cope up with long sentences. Wheareas the Model used here following the paper encodes the input senctence into sequence of vectors and chooses a subset of these vectors adapatively through attention while decoding the translation.
Courses like Sequence models from Cousera and CS 224n are very helpful in understanding differences vetween RNN's, LSTM's, GRU's, Bidirectional units intuition on encoders, decoders, attention mechanism etc. NMT model built here for experimentation is a modified version of the network published in the paper. Shrinked version was built to reduce training time which makes it possible to conduct more experiments, given the budget was limited! Specifications of the architecture built can be seen in the python files.
Used the paralled corpora from french-english provided by ACL WMT'14, link here. This dataset contains some 40M odd samples. Sampled approximately 140K samples for training and 2.3K for testing the model.
The encoder part of the model is composed of Bidirectional LSTM, whereas the decoder part of the model is composed of unidirectional LSTM with 512 hidden units each. An input length 30 words is used, further expermentation should be done by increasing this, and a total vocabulary of size 30K is considered for both French and English individually. An embedding of size 100 dimension is used.
Trained the model for almost 5 days till 400 epochs. Used beam search to find a translation that approximately maximizes the conditional probability, link to the paper, and obtained a Bleu score of 10.9 on the above sampled test set
example 1 :
French input : comité préparatoire de la conférence des nations unies sur le commerce illicite des armes légères sous tous ses aspects
Actual English Translation : preparatory committee for the united nations conference on the illicit trade in small arms and light weapons in all its aspects
Model's English Translation : preparatory committee for the united nations conference on the illicit trade in small arms and light weapons in all its aspects
example 2 :
French input : il est grand temps que la communauté internationale applique cette résolution
Actual English Translation : it was high time that the international community implemented that resolution
Model's English Translation : it is high time that the international community should be adopted by the resolution
example 3 :
French input : conclusions concertées sur l'élimination de toutes les formes de discrimination et de violence à l'égard des petites filles
Actual English Translation : agreed conclusions on the elimination of all forms of discrimination and violence against the girl child
Model's English Translation : conclusions conclusions on the elimination of all forms of discrimination and violence against the young people.
- Visualize the model's attention on different words while translating a source language sample (language being French here)
- Play with the architecture and its hyperparameters like for example, changing the dimensions and hidden states units of embeddings and LSTM's respectively, varying the input sentence length, Stacking more layers on the encoder and decoder etc.
- Explore [Attention is all you need] and implement a version of this for the task of translation
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014)
- Sutskever, I., Vinyals, O. and Le, Q.V., 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104-3112)
- Papineni, K., Roukos, S., Ward, T. and Zhu, W.J., 2002, July. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 311-318). Association for Computational Linguistics
- https://www.tensorflow.org/beta/tutorials/text/nmt_with_attention
- CS224n: Natural Language Processing with Deep Learning
- Sequence Models