Skip to content

Latest commit

 

History

History
executable file
·
51 lines (38 loc) · 1.84 KB

File metadata and controls

executable file
·
51 lines (38 loc) · 1.84 KB

Data Pre-processing for Neural Machine Translation

These scripts provide an example of pre-processing data for the NMT task in our paper, adapted from the original fairseq repo.

Preprocessing

mosedecoder for evaluation

Please clone the mose decoder repo under the data-preprocessing directory for WMT evaluation.

git clone https://github.com/moses-smt/mosesdecoder

prepare-iwslt14.sh

Provides an example of pre-processing for IWSLT'14 German to English translation task: "Report on the 11th IWSLT evaluation campaign" by Cettolo et al.

Example usage for reproduction:

# Download and prepare raw data:
$ cd trans-scripts/data-preprocessing/
$ bash prepare-iwslt14.sh
$ cd ../..

# Binarize the dataset:
$ TEXT=trans-scripts/data-preprocessing/iwslt14.tokenized.de-en
$ python preprocess.py --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/iwslt14.tokenized.de-en.joined \
  --joined-dictionary

prepare-wmt14en2de.sh

Provides an example of pre-processing for the WMT'14 English to German translation task. By default it will produce a dataset that was modeled after "Attention Is All You Need" by Vaswani et al. that includes news-commentary-v12 data.

Example usage for reproduction:

# Download and prepare raw data:
$ cd trans-scripts/data-preprocessing/
$ bash prepare-wmt14en2de.sh
$ cd ../..

# Binarize the dataset:
$ TEXT=trans-scripts/data-preprocessing/wmt14_en_de
$ python preprocess.py --source-lang en --target-lang de \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/wmt14_en_de_joined_dict \
  --joined-dictionary