The ability to communicate with one another is a fundamental part of our daily life. There are nearly 7,000 different languages worldwide. As our world becomes increasingly connected, language translation provides a critical cultural and economic bridge between people from different countries and ethnic groups. Some of the more obvious use-cases include:
- BUSINESS: international trade, investment, contracts, finance
- COMMERCE: travel, purchase of foreign goods and services, customer support
- MEDIA: accessing information via search, sharing information via social networks, localization of content and advertising
- EDUCATION: sharing of ideas, collaboration, translation of research papers
- GOVERNMENT: foreign relations, negotiation
To meet these needs, technology companies are investing heavily in machine translation. This investment and recent advancements in deep learning have yielded major improvements in translation quality.
According to Google, switching to deep learning produced a 60% increase in translation accuracy compared to the phrase-based approach previously used in Google Translate. Today, Google and Microsoft can translate over 100 different languages and are approaching human-level accuracy for many of them. However, while machine translation has made lots of progress, it’s still not perfect. �
The goal is to have machines translate content well enough for human translators to understand its meaning and easily improve upon the text.
The dataset used in this project for the task of converting/translating one language to another language is taken from the below website.
We downloaded an English-German dataset that consists of bilingual sentence pairs from the Tatoeba Project. In this project, text in English language is being translated into German Language.
Each line in the dataset is a tab-delimited text consisting of an English text sequence, the translated German text sequence and some information regarding attribution. Note that each text sequence can be just one sentence, or a paragraph of multiple sentences. In this machine translation problem where English is translated into German, English is called the source language and German is called the target language.
The text data available at the above mentioned website is present in the raw form( consists of punctuations, symbols, letters in both lower & upper case). So, pre-processing has been performed on the downloaded dataset and has been made ready! to be used for implementing various Deep Learning models and architectures.
Below are a few pre-processing steps implemented on the downloaded dataset:
- Load & examine the data.
- Cleaning the data. (includes the following)
- Splitting each sample/text into English-German pairs.
- Converting the data into an array for easy implementation.
- Reducing the size of dataset to save the computation cost.(Only in this case)
- Removing irrelevant text like attribution details
- Removing punctuations.
- Converting the text to lower case.
- Tokenizing & vectorizing the text into numerical sequences.
- Padding those sequences with 0’s to bring them to same length.
After Pre-processing, there are 2 columns. One column has English words/sentences and the other one has German words/sentences. And this dataset can now be used for language translation task.
- RNN + LSTM | Link
- RNN + Embedding + BiRNN | Link
- RNN + LSTM | Link
- LSTM + Attention Mechanism | Link
- Encoder-Decoder with Loung Attention Layer | Link
Continuous improvements in the implementations of the models are important. The idea behind the attention mechanism was to permit the decoder to utilize the most relevant parts of the input sequence in a flexible manner, by a weighted combination of all the encoded input vectors, with the most relevant vectors being attributed the highest weights.
Hence, the model with the attention mechanism among the implemented models is performing better than the others by understanding the context behind the meaning of each input sentence.
We can implement Neural Machine Translation by using the Transformer networks which can perform way better than any other models including the Google translate model.
For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times.