Scripts for WMT17 English-Chinese translation task

This repository contains preprocessing scripts used for WMT17 English-Chinese translation task at the 2017 Workshops on Statistical Machine Translation.

Dependencies

Preprocessing

We build the preprocessing scripts used for WMT17 Chinese-English translation task mostly following Hassan et al. (2018) resulting 20M sentence pairs but with some minor changes. Particularly, We filter the bilingual corpus according to the following criteria:

Both the source and target sentence should contain at most 80 words.
Sentence pairs with blank lines are removed. (remove_blanks.py)
Chinese sentences without any Chinese characters are removed. (is_chinese.py)
Duplicated sentence pairs are removed. (deduplicate_lines.py)

Training

Using preprocessed dataset, we train transformer models in both base and big configurations(Vaswani et al., 2017) based on fairseq toolkit with 8 Tesla V100 GPUs. The training scripts is:

python train.py \
    data-bin/wmt17.en-zh \
    --source-lang en --target-lang zh \
    --arch transformer_wmt_en_de \
    --save-dir model_dir \
    --ddp-backend=no_c10d \
    --criterion label_smoothed_cross_entropy \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr 1e-09 --warmup-updates 4000 \
    --warmup-init-lr 1e-07 --label-smoothing 0.1 \
    --dropout 0.25 --weight-decay 0.0 \
    --max-tokens 16000 \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \ 
    --save-interval-updates 10000 \
    --max-update 300000 \
    --update-freq 1 \ 
    --fp16 \
    --save-interval 1

Evaluation

we utilizes newsdev2017 and newstest2017 as development and test sets respectively. we apply beam search with a beam width of 5 and tune length penalty of [0.0, 0.2, · · · , 2.0] in development set. SacreBLEU(Post, 2018) is measured to evaluate the translation performance on WMT17 English->Chinese dataset.

Model	Transformer-Base	Transformer-Big
BLEU	35.37	36.73

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
deduplicate_lines.py		deduplicate_lines.py
is_chinese.py		is_chinese.py
prepare-wmt17en2zh.sh		prepare-wmt17en2zh.sh
remove_blanks.py		remove_blanks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts for WMT17 English-Chinese translation task

Dependencies

Preprocessing

Training

Evaluation

About

Releases

Packages

Languages

xwgeng/WMT17-scripts

Folders and files

Latest commit

History

Repository files navigation

Scripts for WMT17 English-Chinese translation task

Dependencies

Preprocessing

Training

Evaluation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages