This repository contains an implementation of unsupervised machine translation, primarily for the purpose of translating text styles without a parallel translation dataset. Implemented using Neural Networks built with PyTorch and natural language processing methods.
Given a dataset of a foreign language sentences, a dataset of target language sentences, and a dataset of an initial rough translation in the target language from the foreign language, the unsupervised model can be trained as follows:
usage: unsupervisedTranslation.py [-h] [-f DATAF] [-t DATAT] [--initial INITIAL] [--percentTrain PERCENTTRAIN] [--epochs EPOCHS] [--iterations ITERATIONS] [-o OUTFILE] [--load LOAD][--savetf SAVETF] [--saveft SAVEFT]
optional arguments:
-h, --help show this help message and exit
-f DATAF, --dataf DATAF
foreign language data
-t DATAT, --datat DATAT
target language data
--initial INITIAL Initial rough translation of target language from foreign language data
--percentTrain PERCENTTRAIN
Percent to be used for training data (in decimal form), the remaining will be split between dev and test
--epochs EPOCHS, -e EPOCHS
Number of epochs to train per model per iteration
--iterations ITERATIONS, -i ITERATIONS
Number of iterations to go through
-o OUTFILE, --outfile OUTFILE
write translations to file
--load LOAD load model from file
--savetf SAVETF save target to foreign model in file
--saveft SAVEFT save foreign to target model in file
There are two examples data sources included in this repository for the unsupervised training: ironyTranslation and backstrokeSubtitles. The ironyTranslation example is the primary example, used for a Natural Language Processing Project, while backstrokeSubtitles is an example used for experimentation.
Using a combination of prelabeled datasets from Kaggle, one from Twitter Tweets and one from Reddit Comments, the goal of this example was to label ironic statements and then translate these ironic statements into unironic statements.
- baseline.py: Baseline classification of ironic sentences using Naive Bayes classification., scoring 53.12% prediction accuracy.
- baselineNegation.py: Script to roughly negate the sentence, theoretically roughly translating an ironic statement into an unironic statement.
- rnnClassifier.py: Classifier built using Recurrent Neural Networks and PyTorch, scoring a 92.3% prediction accuracy.
- interactive.py: Interactive system that takes in a user input, predicting whether the input is ironic and translating the statement into an unironic statement if it is.
- data: Contains the initial data used for the example
- models: Saved models from training, foreign-target.torch and target-forein.torch are from the unsupervised translation script while the ModelRnnClassifier-%accuracy.torch are the rnn classification models.
- outputs: Various outputs from the trained models, comparing the text with the output text or classification from the training.
Example unsupervised training command, from the root directory:
python3 unsupervisedTranslation.py \
--dataf examples/ironyTranslation/data/ironicSentences.txt \
--datat examples/ironyTranslation/data/unironicSentences.txt \
--initial examples/ironyTranslation/data/roughUnironicSent.txt \
-o examples/ironyTranslation/outputs/ironic_to_unironic \
--savetf examples/ironyTranslation/models/target-forein.torch \
--saveft examples/ironyTranslation/models/foreign-target.torch \
--epochs 2 --iterations 4
Using data from subtitle translations from the movie Star Wars: Episode III – Revenge of the Sith, used unsupervised approach to experiment with translations.
- parse.py: Parse initial text data.
- roughEnglish.txt: Initial rough translation into the target language.
- models: Saved models from training, foreign-target.torch and target-forein.torch are from the unsupervised translation script.
- outputs: Various outputs from the trained models, comparing the text with the output text.
Example unsupervised training command, from the root directory:
python3 unsupervisedTranslation.py \
--dataf examples/backstrokeSubtitles/chinese.txt \
--datat examples/backstrokeSubtitles/english.txt \
--initial examples/backstrokeSubtitles/roughEnglish.txt \
-o examples/backstrokeSubtitles/outputs/chinese_to_english \
--savetf examples/backstrokeSubtitles/models/english-chinese.torch \
--saveft examples/backstrokeSubtitles/models/chinese-english.torch \
--epochs 2 --iterations 4