This repository contains preprocessing scripts used for WMT17 English-Chinese translation task at the 2017 Workshops on Statistical Machine Translation.
We build the preprocessing scripts used for WMT17 Chinese-English translation task mostly following Hassan et al. (2018) resulting 20M sentence pairs but with some minor changes. Particularly, We filter the bilingual corpus according to the following criteria:
- Both the source and target sentence should contain at most 80 words.
- Sentence pairs with blank lines are removed. (
remove_blanks.py
) - Chinese sentences without any Chinese characters are removed. (
is_chinese.py
) - Duplicated sentence pairs are removed. (
deduplicate_lines.py
)
Using preprocessed dataset, we train transformer models in both base and big configurations(Vaswani et al., 2017) based on fairseq toolkit with 8 Tesla V100 GPUs. The training scripts is:
python train.py \
data-bin/wmt17.en-zh \
--source-lang en --target-lang zh \
--arch transformer_wmt_en_de \
--save-dir model_dir \
--ddp-backend=no_c10d \
--criterion label_smoothed_cross_entropy \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt \
--min-lr 1e-09 --warmup-updates 4000 \
--warmup-init-lr 1e-07 --label-smoothing 0.1 \
--dropout 0.25 --weight-decay 0.0 \
--max-tokens 16000 \
--log-format 'simple' --log-interval 100 \
--fixed-validation-seed 7 \
--save-interval-updates 10000 \
--max-update 300000 \
--update-freq 1 \
--fp16 \
--save-interval 1
we utilizes newsdev2017 and newstest2017 as development and test sets respectively. we apply beam search with a beam width of 5 and tune length penalty of [0.0, 0.2, · · · , 2.0]
in development set. SacreBLEU(Post, 2018) is measured to evaluate the translation performance on WMT17 English->Chinese dataset.
Model | Transformer-Base | Transformer-Big |
---|---|---|
BLEU | 35.37 | 36.73 |