This repo contains some code and a model for the first place solution of RuSimpleSentEval.
It is actually highly based on the ideas from Multilingual Unsupervised Sentence Simplification.
Basically, it's mBART finetuned on the paraphrases from ParaPhraserPlus and automatically translated WikiSimple conditioned on specific control tokens.
The control tokens give us ability to train the model on everything that is semantically related and then to choose those control token values which work better for simplification (according to some metric).
The following control tokens were implemented:
- Levenshtein similarity - how similar (obviously by levenshtein metric) the result text should be;
- Chars fraction - how long the result text should be (that is, the you can specify the ratio between the result and original text lengths);
- Word rank - how simple the result text is expected to be (it's a ratio again between the ranks of the words in the texts in fasttext embeddings). Didn't work well in my opinion;
- Lexeme similarity - how similar by lexeme matching the result text should be. Wasn't used in the final model, though.
- Fairseq checkpoint
- Huggingface checkpoint + control token mapping
- Preprocessed train data + some files used for preprocessing
The training process consist of the following steps.
The instructions are based on the baseline solution.
- Install SentencePiece:
sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake ..
make
sudo make install
sudo ldconfig -v
- Install fairseq (current pip version doesn't have all required features, but it should change at some point):
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
Also I added the following line:
diff --git a/fairseq/tasks/translation_from_pretrained_bart.py b/fairseq/tasks/translation_from_pretrained_bart.py
index 8710b7f..a7ff1db 100644
--- a/fairseq/tasks/translation_from_pretrained_bart.py
+++ b/fairseq/tasks/translation_from_pretrained_bart.py
@@ -51,6 +51,7 @@ class TranslationFromPretrainedBARTTask(TranslationTask):
def __init__(self, args, src_dict, tgt_dict):
super().__init__(args, src_dict, tgt_dict)
+ self.args = args
self.langs = args.langs.split(",")
for d in [src_dict, tgt_dict]:
for l in self.langs:
here, but I have no idea whether you still need it in your (newer) version of fairseq.
- I stored everything in the
data
folder and run my scripts from thesolution
folder, so the hierarchy looks this way:
- data/
-- data/data-bin/
-- data/ParaPhraserPlus
-- data/preprocessed_data/
-- data/WikiSimple-translated
- solution/
Assuming that you are at the solution
folder, run:
mkdir ../data
mkdir ../data/preprocessed_data/
And download the mBART checkpoint:
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz
tar -xzvf mbart.cc25.v2.tar.gz -C ../data
Everything here is written in the assumption that you are running it from the solution
folder. Sorry, I was a bit lazy to configure it anyhow better.
However, you can skip these steps and use the data from the downloads section.
- Specify the following environment variables:
SPM=<path to sentencepiece>
BPE_MODEL=../data/mbart.cc25.v2/sentence.bpe.model
DATA_DIR=../data/preprocessed_data
PREPROCESSED_DATA_DIR=../data/data-bin
DICT=../data/mbart.cc25.v2/dict.txt
- Prepare data: download ParaPhraserPlus and automatically translated WikiSimple and extract them to the
data
folder. Run:
python prepare_control_tokens.py
python merge_data.py
to prepare data.
- Preprocess everything with sentencepiece:
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/train.src > ${DATA_DIR}/train.spm.src &
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/valid.src > ${DATA_DIR}/valid.spm.src &
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/train.dst > ${DATA_DIR}/train.spm.dst &
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/valid.dst > ${DATA_DIR}/valid.spm.dst &
- Add the control tokens to the data:
python add_control_tokens.py
It finds the control tokens for each (src, dst)
pair and finds unused tokens from the dictionary that can be used for the conditioning. It would have been cleaner to use some new tokens for this purpose, but I didn't know for sure how to add new tokens to a model in fairseq, so I stuck to a hackier option.
- Run the binarization function:
fairseq-preprocess \
--source-lang src \
--target-lang dst \
--trainpref ${DATA_DIR}/train.spm \
--validpref ${DATA_DIR}/valid.spm \
--destdir ${PREPROCESSED_DATA_DIR} \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict ${DICT} \
--tgtdict ${DICT} \
--workers 70
- Train:
PRETRAIN=../data/mbart.cc25.v2/model.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
CUDA_VISIBLE_DEVICES=0
fairseq-train ../data/data-bin \
--encoder-normalize-before --decoder-normalize-before \
--arch mbart_large --layernorm-embedding \
--task translation_from_pretrained_bart \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 10000 --total-num-update 100000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 \
--source-lang src --target-lang dst \
--batch-size 8 \
--update-freq 4 \
--validate-interval 1 \
--patience 3 \
--max-epoch 25 \
--save-interval-updates 500 --keep-interval-updates 1 --keep-best-checkpoints 1 --no-save-optimizer-state \
--seed 42 --log-format tqdm \
--restore-file ${PRETRAIN} \
--reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
--ddp-backend no_c10d \
--langs $langs \
--scoring bleu \
--save-dir ../checkpoints
Find batch-size and update-freq that suits your gpu better, use fp16 whenever it's possible.
- Generate:
CUDA_VISIBLE_DEVICES=0
LANG=C.UTF-8 LC_ALL=C.UTF-8
fairseq-generate ${DATA_DIR} \
--path ${SAVE_DIR}/checkpoint_best.pt \
--task translation_from_pretrained_bart \
--gen-subset test \
--source-lang src --target-lang dst \
--bpe 'sentencepiece' --sentencepiece-model ${BPE_MODEL} \
--sacrebleu --remove-bpe 'sentencepiece' \
--batch-size 32 --langs $langs > model_prediction.txt &
cat model_prediction.txt | grep -P "^H" |sort -V |cut -f 3- > model_prediction.hyp
- Find the control tokens that work better. Generate data with random control token combinations and choose the best by SARI on the dev set:
python generate_devs_with_control_tokens.py
The script generates lots of lines of dev set conditioned on different control tokens. Generate the simplifications for them from your model and evaluate them using easse evaluate
script. Find the best params this way.
I used the following params:
NbChars = 0.95
LevSim = 0.4
WordRank = 1.6
To be honest, the model hallucinate a lot in such setup, but SARI prefers it to any other (more sane in my opition) combination of tokens...
Better control tokens could have been selected using some human evaluation if it wasn't so expensive, ofc.
- Preprocess the test dataset as in Data Preprocessing step 3 and run:
python generate_test_with_control_tokens.py
with your optimal tokens here.