WMT22-En-Liv

This is the implementaion of Tencent AI Lab - Shanghai Jiao Tong University (TAL-SJTU) 's English-Livonian submissions for the Sixth Conference on Machine Translation (WMT22). We provide all the models, data, code and scripts in this repository. More details are available in our system description paper.

Note: We find that Liv4ever-MT has been underestimated due to inconsistent Unicode normalization. Please see liv4ever-mt-re-eval to reproduce our results.

News

We won 1st place🥇for English=>Livonian and 2nd place🥈for Livonian<=English (Unconstrained System). [Official Results]

Overview

Cross-model word embedding alignment: transfer the word embeddings of Liv4ever-MT to M2M100, enabling it to support Livonian.
4-lingual M2M training: many-to-many translation training for all language pairs in {En, Liv, Et, Lv}, using only parallel data.
Synthetic data generation: generate synthetic bi-text for En-Liv, using Et and Lv as pivot languages.
Combine data and retrain: combine all the authentic and synthetic bi-text and retrain the model.
Fine-tune & post-process: fine-tune the model on En⇔Liv using the validation set and perform online back-translation using monolingual data. Finally, apply rule-based post-processing to the model output.

Preparation

Download pre-trained models

# M2M100 1.2B
mkdir -p PTModels/M2M100
wget  -P PTModels/M2M100 https://dl.fbaipublicfiles.com/m2m_100/1.2B_last_checkpoint.pt
wget  -P PTModels/M2M100 https://dl.fbaipublicfiles.com/m2m_100/model_dict.128k.txt
wget  -P PTModels/M2M100 https://dl.fbaipublicfiles.com/m2m_100/language_pairs_small_models.txt

# Liv4ever-MT
yum install git-lfs
git lfs install
git clone https://huggingface.co/tartuNLP/liv4ever-mt PTModels/Liv4ever-MT

Dependencies

python==3.8.12
pytorch==1.10.0
sentencepiece==0.1.96
nltk==3.7
sacrebleu=2.0.0
fairseq
```
pip3 install -e ./fairseq
```

Cross-model word embedding alignment (CMEA)

Processed model: 1.2B_last_checkpoint_cmea_emb.pt
Dictionary: merge_dict.txt

CMEA scripts

Note: You can use --help to see the full uage of each script.

SRC_MODEL_NAME=liv4ever_mt
TGT_MODEL_NAME=m2m100_1_2B
CEMA_DIR=PTModels/M2M100-CMEA

mkdir -p $CEMA_DIR

# Obtain the overlapping vocabulary
python3 tools/get-overlap.py \
		--d1 PTModels/Liv4ever-MT/dict.src.txt \
		--d2 PTModels/M2M100/model_dict.128k.txt \
		> $CEMA_DIR/overlap-voc.$SRC_MODEL_NAME-$TGT_MODEL_NAME.txt

# Extract word embeddings from models
python3 tools/extract-word-emb.py \
    --model PTModels/Liv4ever-MT/checkpoint_best.pt \
    --dict  PTModels/Liv4ever-MT/dict.src.txt \
    --name $SRC_MODEL_NAME \
    --dest $CEMA_DIR/word-emb-$SRC_MODEL_NAME.pth

python3 tools/extract-word-emb.py \
    --model PTModels/M2M100/1.2B_last_checkpoint.pt \
    --dict  PTModels/M2M100/model_dict.128k.txt \
    --name $TGT_MODEL_NAME \
    --dest $CEMA_DIR/word-emb-$TGT_MODEL_NAME.pth

# Cross-model word embedding alignment
python3 tools/CMEA/supervised-inconsistent-dimensions.py \
    --exp_path $CEMA_DIR \
    --exp_name $SRC_MODEL_NAME-$TGT_MODEL_NAME-cema \
    --exp_id main \
    --src_lang $SRC_MODEL_NAME \
    --tgt_lang $TGT_MODEL_NAME \
    --src_emb_dim 512 \
    --tgt_emb_dim 1024 \
    --n_refinement 0 \
    --cuda False \
    --dico_train $CEMA_DIR/overlap-voc.$SRC_MODEL_NAME-$TGT_MODEL_NAME.txt \
    --src_emb $CEMA_DIR/word-emb-$SRC_MODEL_NAME.pth \
    --tgt_emb $CEMA_DIR/word-emb-$TGT_MODEL_NAME.pth \
    --export pth

# Get the final dictionary (Liv4ever-MT's dict + Lang tokens + madeupwords)
cat PTModels/Liv4ever-MT/dict.trg.txt > $CEMA_DIR/merge_dict.txt

echo "__liv__ 1" >> $CEMA_DIR/merge_dict.txt
sed -n '128001,128100p' PTModels/M2M100/model_dict.128k.txt >> $CEMA_DIR/merge_dict.txt

echo "madeupwordforbt 1" >> $CEMA_DIR/merge_dict.txt
echo "madeupword0000 0"  >> $CEMA_DIR/merge_dict.txt
echo "madeupword0001 0"  >> $CEMA_DIR/merge_dict.txt

# Replace the original embedding with the new one
python3 tools/CMEA/change-emb.py \
    --model PTModels/M2M100/1.2B_last_checkpoint.pt \
    --emb1 $CEMA_DIR/$SRC_MODEL_NAME-$TGT_MODEL_NAME-cema/main/vectors-$SRC_MODEL_NAME.pth \
    --emb2 $CEMA_DIR/$SRC_MODEL_NAME-$TGT_MODEL_NAME-cema/main/vectors-$TGT_MODEL_NAME.pth \
    --dict $CEMA_DIR/merge_dict.txt \
    --add-mask \
    --dest $CEMA_DIR/1.2B_last_checkpoint_cmea_emb.pt

echo "The processed model is stored in $CEMA_DIR/1.2B_last_checkpoint_cmea_emb.pt"
echo "The processed dictionary is stored in $CEMA_DIR/merge_dict.txt"

Data

Download

We provide filtered data for download, both authentic and synthetic (En-Liv only):

Download the files to the data/mono or data/para directory, and the structure should be:

data
├── data-bin
├── eval
│   ├── benchmark-test.en
│   ├── benchmark-test.et
│   ├── benchmark-test.liv
│   ├── benchmark-test.lv
│   ├── process-eval-data.sh
│   └── wmttest2022.en-de.en
├── mono
│   ├── clean.en
│   ├── clean.liv
│   └── process-mono-data.sh
└── para
    ├── clean.auth.en-et.en
    ├── clean.auth.en-et.et
    ├── clean.auth.en-liv.en
    ├── clean.auth.en-liv.liv
    ├── clean.auth.en-lv.en
    ├── clean.auth.en-lv.lv
    ├── clean.auth.et-liv.et
    ├── clean.auth.et-liv.liv
    ├── clean.auth.et-lv.et
    ├── clean.auth.et-lv.lv
    ├── clean.auth.liv-lv.liv
    ├── clean.auth.liv-lv.lv
    ├── clean.syn.en-liv.en
    ├── clean.syn.en-liv.liv
    └── process-para-data.sh

Processing

Encode raw text into sentence pieces and binarize (this may take a long time):

# apply spm and binarize
sh data/eval/process-eval-data.sh
sh data/para/process-para-data.sh
sh data/mono/process-mono-data.sh

# create data-bins
sh data/data-bin/create-data-bin.sh

The binary files will be stored in data/data-bin/auth (authentic) and data/data-bin/auth-syn (authentic+synthetic).

Model training

4-lingual M2M training

GPUs: 4 nodes x 8 A100-SXM4-40GB/node
Trained model: m2m04.pt

Training script:

$EXP_NAME=ptm.mm100-1.2b-cmea+task.mt+lang.enlvetli+temp.5+data.auth
mkdir -p $EXP_NAME

python3 -m torch.distributed.launch --nproc_per_node=8 \
   --nnodes=4 --node_rank=0 --master_addr="xxx.xxx.xxx.xxx" \
   --master_port=xxxxx \
   $(which fairseq-train) data/data-bin/auth \
   --finetune-from-model PTModels/M2M100-CMEA/1.2B_last_checkpoint_cmea_emb.pt \
   --num-workers 0 \
   --encoder-normalize-before  \
   --decoder-normalize-before  \
   --arch transformer_wmt_en_de_big \
   --task multilingual_semisupervised_translation \
   --train-tasks mt \
   --share-all-embeddings  \
   --share-decoder-input-output-embed  \
   --encoder-layerdrop 0.05 \
   --decoder-layerdrop 0.05 \
   --activation-dropout 0.0 \
   --encoder-layers 24 \
   --decoder-layers 24 \
   --encoder-ffn-embed-dim 8192 \
   --decoder-ffn-embed-dim 8192 \
   --encoder-embed-dim 1024 \
   --decoder-embed-dim 1024 \
   --sampling-method temperature \
   --sampling-temperature 5 \
   --encoder-langtok src \
   --decoder-langtok  \
   --langs en,liv,et,lv \
   --lang-pairs en-liv,liv-en,en-et,et-en,en-lv,lv-en,liv-et,et-liv,liv-lv,lv-liv,et-lv,lv-et \
   --criterion label_smoothed_cross_entropy \
   --label-smoothing 0.2 \
   --optimizer adam \
   --adam-eps 1e-08 \
   --adam-betas 0.9,0.98 \
   --lr-scheduler inverse_sqrt \
   --lr 0.0005 \
   --warmup-init-lr 1e-07 \
   --warmup-updates 2000 \
   --max-update 10000 \
   --dropout 0.3 \
   --attention-dropout 0.1 \
   --weight-decay 0.0 \
   --max-tokens 1024 \
   --max-tokens-valid 1024 \
   --update-freq 2 \
   --virtual-epoch-size 10000000 \
   --skip-remainder-batch  \
   --no-progress-bar  \
   --log-format simple \
   --log-interval 2 \
   --best-checkpoint-metric loss \
   --patience 5 \
   --skip-invalid-size-inputs-valid-test  \
   --no-epoch-checkpoints  \
   --eval-lang-pairs et-liv,liv-et,lv-liv,liv-lv \
   --valid-subset valid \
   --validate-interval-updates 500 \
   --save-interval-updates 500 \
   --keep-interval-updates 5 \
   --fp16  \
   --seed 42 \
   --ddp-backend no_c10d \
   --save-dir $EXP_NAME/ckpts \
   --distributed-no-spawn  \
   --tensorboard-logdir $EXP_NAME/tensorboard

mv $EXP_NAME/ckpts/checkpoint_best.pt $EXP_NAME/ckpts/m2m04.pt

Combine data and retrain

GPUs: 4 nodes x 8 A100-SXM4-40GB/node
Trained model: m2m04-retrained.pt (slightly different from that in the paper)

Training script:

$EXP_NAME=ptm.mm100-1.2b-cema+task.mt+lang.enlvetli+samp.concat+data.auth-syn
mkdir -p $EXP_NAME

python3 -m torch.distributed.launch --nproc_per_node=8 \
   --nnodes=4 --node_rank=0 --master_addr="xxx.xxx.xxx.xxx" \
   --master_port=xxxxx \
   $(which fairseq-train) data/data-bin/auth-syn \
   --finetune-from-model PTModels/M2M100-CMEA/1.2B_last_checkpoint_cmea_emb.pt \
   --num-workers 0 \
   --encoder-normalize-before  \
   --decoder-normalize-before  \
   --arch transformer_wmt_en_de_big \
   --task multilingual_semisupervised_translation \
   --train-tasks mt \
   --share-all-embeddings  \
   --share-decoder-input-output-embed  \
   --encoder-layerdrop 0.05 \
   --decoder-layerdrop 0.05 \
   --activation-dropout 0.0 \
   --encoder-layers 24 \
   --decoder-layers 24 \
   --encoder-ffn-embed-dim 8192 \
   --decoder-ffn-embed-dim 8192 \
   --encoder-embed-dim 1024 \
   --decoder-embed-dim 1024 \
   --encoder-langtok src \
   --decoder-langtok  \
   --langs en,liv,et,lv \
   --lang-pairs en-liv,liv-en,en-et,et-en,en-lv,lv-en,liv-et,et-liv,liv-lv,lv-liv,et-lv,lv-et \
   --criterion label_smoothed_cross_entropy \
   --label-smoothing 0.2 \
   --optimizer adam \
   --adam-eps 1e-08 \
   --adam-betas 0.9,0.98 \
   --lr-scheduler inverse_sqrt \
   --lr 0.0005 \
   --warmup-init-lr 1e-07 \
   --warmup-updates 2000 \
   --max-update 10000 \
   --dropout 0.3 \
   --attention-dropout 0.1 \
   --weight-decay 0.0 \
   --max-tokens 1024 \
   --max-tokens-valid 1024 \
   --update-freq 2 \
   --virtual-epoch-size 10000000 \
   --skip-remainder-batch  \
   --no-progress-bar  \
   --log-format simple \
   --log-interval 2 \
   --best-checkpoint-metric loss \
   --patience 10 \
   --skip-invalid-size-inputs-valid-test  \
   --no-epoch-checkpoints  \
   --eval-lang-pairs en-liv,liv-en \
   --valid-subset valid \
   --validate-interval-updates 500 \
   --save-interval-updates 500 \
   --keep-interval-updates 5 \
   --fp16  \
   --seed 42 \
   --ddp-backend no_c10d \
   --save-dir $EXP_NAME/ckpts \
   --distributed-no-spawn  \
   --tensorboard-logdir $EXP_NAME/tensorboard

mv $EXP_NAME/ckpts/checkpoint_best.pt $EXP_NAME/ckpts/m2m04-retrained.pt

Fintuning

GPUs: 1 nodes x 1 A100-SXM4-40GB/node
Trained model: m2m04-retrained-finetuned.pt (slightly different from that in the paper)

Training script:

$EXP_NAME=ptm.retrained+task.mt-bt+lang.enliv+samp.uni+data.valid-and-mono
mkdir -p $EXP_NAME

fairseq-train data/data-bin/auth-syn \
   --train-subset finetune \
   --finetune-from-model ptm.mm100-1.2b-cema+task.mt+lang.enlvetli+samp.concat+data.auth-syn/ckpts/m2m04-retrained.pt \
   --num-workers 0 \
   --encoder-normalize-before  \
   --decoder-normalize-before  \
   --arch transformer_wmt_en_de_big \
   --task multilingual_semisupervised_translation \
   --train-tasks mt,bt \
   --share-all-embeddings  \
   --share-decoder-input-output-embed  \
   --encoder-layerdrop 0.05 \
   --decoder-layerdrop 0.05 \
   --activation-dropout 0.0 \
   --encoder-layers 24 \
   --decoder-layers 24 \
   --encoder-ffn-embed-dim 8192 \
   --decoder-ffn-embed-dim 8192 \
   --encoder-embed-dim 1024 \
   --decoder-embed-dim 1024 \
   --sampling-method uniform \
   --encoder-langtok src \
   --decoder-langtok  \
   --langs en,liv,et,lv \
   --lang-pairs liv-en,en-liv \
   --criterion label_smoothed_cross_entropy \
   --label-smoothing 0.2 \
   --optimizer adam \
   --adam-eps 1e-08 \
   --adam-betas 0.9,0.98 \
   --lr-scheduler inverse_sqrt \
   --lr 0.0001 \
   --warmup-init-lr 1e-07 \
   --warmup-updates 2000 \
   --max-update 500 \
   --dropout 0.3 \
   --attention-dropout 0.1 \
   --weight-decay 0.0 \
   --max-tokens 1024 \
   --max-tokens-valid 1024 \
   --update-freq 2 \
   --virtual-epoch-size 10000000 \
   --no-progress-bar  \
   --log-format simple \
   --log-interval 2 \
   --no-epoch-checkpoints  \
   --save-interval-updates 50 \
   --keep-interval-updates 2 \
   --disable-validation  \
   --fp16  \
   --seed 42 \
   --ddp-backend no_c10d \
   --save-dir $EXP_NAME/ckpts \
   --distributed-no-spawn  \
   --tensorboard-logdir $EXP_NAME/tensorboard

mv $EXP_NAME/ckpts/checkpoint_last.pt $EXP_NAME/ckpts/m2m04-retrained-finetuned.pt

Evaluation

Test set

Generate translations

MODEL_PATH=ptm.retrained+task.mt-bt+lang.enliv+samp.uni+data.valid-and-mono/ckpts/m2m04-retrained-finetuned.pt
DICT_PATH=PTModels/M2M100-CMEA/merge_dict.txt
LNG_PAIRS=liv-en,en-liv
LNGS=en,liv,et,lv

for lng_pair in en-liv liv-en
do
    SRC=${lng_pair%%-*}
    TGT=${lng_pair##*-}

    # generate
    fairseq-generate data/data-bin/auth \
        --batch-size 128 \
        --path $MODEL_PATH \
        --fixed-dictionary $DICT_PATH \
        -s $SRC  -t $TGT \
        --remove-bpe 'sentencepiece' \
        --beam 5 \
        --task multilingual_semisupervised_translation \
        --lang-pairs $LNG_PAIRS \
        --langs  $LNGS \
        --decoder-langtok \
        --encoder-langtok src \
        --gen-subset test > wmttest2022.$SRC-$TGT.gen
    cat wmttest2022.$SRC-$TGT.gen | grep -P "^H" | sort -V | cut -f 3-  > wmttest2022.$SRC-$TGT.hyp

    # generate (no-repeat)
    fairseq-generate data/data-bin/auth \
        --batch-size 128 \
        --path $MODEL_PATH \
        --fixed-dictionary $DICT_PATH \
        -s $SRC  -t $TGT \
        --remove-bpe 'sentencepiece' \
        --beam 5 \
        --no-repeat-ngram-size	2 \
        --task multilingual_semisupervised_translation \
        --lang-pairs $LNG_PAIRS \
        --langs  $LNGS \
        --decoder-langtok \
        --encoder-langtok src \
        --gen-subset test > wmttest2022.$SRC-$TGT.no-repeat.gen
    cat wmttest2022.$SRC-$TGT.no-repeat.gen | grep -P "^H" | sort -V | cut -f 3-  > wmttest2022.$SRC-$TGT.no-repeat.hyp
done

Post-processing

for lng_pair in en-liv liv-en
do
    SRC=${lng_pair%%-*}
    TGT=${lng_pair##*-}

    if [[ $TGT == "liv" ]]
    then
        python3 tools/post-process.py \
            --src-file data/eval/wmttest2022.$SRC-$TGT.$SRC \
            --hyp-file wmttest2022.$SRC-$TGT.hyp \
            --no-repeat-hyp-file wmttest2022.$SRC-$TGT.no-repeat.hyp \
            --lang $TGT > wmttest2022.$SRC-$TGT.post-processed.hyp
    else
        python3 tools/post-process.py \
            --src-file data/eval/wmttest2022.$SRC-$TGT.$SRC \
            --hyp-file wmttest2022.$SRC-$TGT.hyp \
            --lang $TGT > wmttest2022.$SRC-$TGT.post-processed.hyp
    fi
done

Evaluate

echo "Before post-processing:"
cat wmttest2022.en-liv.hyp | sacrebleu data/references/generaltest2022.en-liv.ref.A.liv
cat wmttest2022.liv-en.hyp | sacrebleu data/references/generaltest2022.liv-en.ref.A.en

echo "After post-processing:"
cat wmttest2022.en-liv.post-processed.hyp | sacrebleu data/references/generaltest2022.en-liv.ref.A.liv
cat wmttest2022.liv-en.post-processed.hyp | sacrebleu data/references/generaltest2022.liv-en.ref.A.en

Outputs:

Before post-processing:
{
 "name": "BLEU",
 "score": 16.1,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "47.0/21.0/10.9/6.2 (BP = 1.000 ratio = 1.050 hyp_len = 9713 ref_len = 9251)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}
{
 "name": "BLEU",
 "score": 30.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "62.3/37.0/24.0/16.2 (BP = 1.000 ratio = 1.003 hyp_len = 10628 ref_len = 10599)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

After post-processing:
{
 "name": "BLEU",
 "score": 17.0,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "49.7/22.3/11.6/6.6 (BP = 1.000 ratio = 1.010 hyp_len = 9342 ref_len = 9251)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}
{
 "name": "BLEU",
 "score": 30.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "62.3/37.0/24.0/16.2 (BP = 1.000 ratio = 1.003 hyp_len = 10628 ref_len = 10599)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

Round-trip BLEU

Generate translations

MODEL_PATH=ptm.retrained+task.mt-bt+lang.enliv+samp.uni+data.valid-and-mono/ckpts/m2m04-retrained-finetuned.pt
DICT_PATH=PTModels/M2M100-CMEA/merge_dict.txt
LNG_PAIRS=liv-en,en-liv
LNGS=en,liv,et,lv

EVAL_DIR=data/eval
SOURCE_FILE=$EVAL_DIR/wmttest2022.en-de.en
SOURCE_SPM_FILE=$EVAL_DIR/wmttest2022.spm.en-de.en

# generate
cat $SOURCE_SPM_FILE | fairseq-interactive $EVAL_DIR \
    --batch-size 128 \
    --buffer-size 1024 \
    --path $MODEL_PATH \
    --fixed-dictionary $DICT_PATH \
    -s en  -t liv \
    --beam 5 \
    --task multilingual_semisupervised_translation \
    --lang-pairs $LNG_PAIRS \
    --langs  $LNGS \
    --decoder-langtok \
    --encoder-langtok src | grep -P "^H" | sort -V | cut -f 3- > round-trip.spm.en-liv
    
cat round-trip.spm.en-liv | fairseq-interactive $EVAL_DIR \
    --batch-size 128 \
    --buffer-size 1024 \
    --path $MODEL_PATH \
    --fixed-dictionary $DICT_PATH \
    --remove-bpe 'sentencepiece' \
    -s liv  -t en \
    --beam 5 \
    --task multilingual_semisupervised_translation \
    --lang-pairs $LNG_PAIRS \
    --langs  $LNGS \
    --decoder-langtok \
    --encoder-langtok src | grep -P "^H" | sort -V | cut -f 3- > round-trip.en-liv-en

Evaluate

cat round-trip.en-liv-en | sacrebleu $SOURCE_FILE

Outputs:

{
 "name": "BLEU",
 "score": 36.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "69.9/45.6/31.7/22.4 (BP = 0.950 ratio = 0.951 hyp_len = 337570 ref_len = 354789)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

Citation

Please cite our system description paper if you found the resources in this repository useful.

@inproceedings{he-etal-2022-tencent,
    title = "Tencent {AI} Lab - Shanghai Jiao Tong University Low-Resource Translation System for the {WMT}22 Translation Task",
    author = "He, Zhiwei  and
      Wang, Xing  and
      Tu, Zhaopeng  and
      Shi, Shuming  and
      Wang, Rui",
    booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.wmt-1.18",
    pages = "260--267",
}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
data		data
fairseq		fairseq
gen		gen
imgs		imgs
liv4ever-mt-re-eval		liv4ever-mt-re-eval
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WMT22-En-Liv

Overview

Preparation

Download pre-trained models

Dependencies

Cross-model word embedding alignment (CMEA)

Data

Download

Processing

Model training

Evaluation

Test set

Round-trip BLEU

Citation

About

Releases

Packages

Languages

zwhe99/WMT22-En-Liv

Folders and files

Latest commit

History

Repository files navigation

WMT22-En-Liv

Overview

Preparation

Download pre-trained models

Dependencies

Cross-model word embedding alignment (CMEA)

Data

Download

Processing

Model training

Evaluation

Test set

Round-trip BLEU

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages