RuSimpleSentEval

This repo contains some code and a model for the first place solution of RuSimpleSentEval.

Play with it:

It is actually highly based on the ideas from Multilingual Unsupervised Sentence Simplification.

Basically, it's mBART finetuned on the paraphrases from ParaPhraserPlus and automatically translated WikiSimple conditioned on specific control tokens.

The control tokens give us ability to train the model on everything that is semantically related and then to choose those control token values which work better for simplification (according to some metric).

The following control tokens were implemented:

Levenshtein similarity - how similar (obviously by levenshtein metric) the result text should be;
Chars fraction - how long the result text should be (that is, the you can specify the ratio between the result and original text lengths);
Word rank - how simple the result text is expected to be (it's a ratio again between the ranks of the words in the texts in fasttext embeddings). Didn't work well in my opinion;
Lexeme similarity - how similar by lexeme matching the result text should be. Wasn't used in the final model, though.

Downloads

Train

The training process consist of the following steps.

Requirements Installation

The instructions are based on the baseline solution.

Install SentencePiece:

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
mkdir build
cd build
cmake ..
make
sudo make install
sudo ldconfig -v

Install fairseq (current pip version doesn't have all required features, but it should change at some point):

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

Also I added the following line:

diff --git a/fairseq/tasks/translation_from_pretrained_bart.py b/fairseq/tasks/translation_from_pretrained_bart.py
index 8710b7f..a7ff1db 100644
--- a/fairseq/tasks/translation_from_pretrained_bart.py
+++ b/fairseq/tasks/translation_from_pretrained_bart.py
@@ -51,6 +51,7 @@ class TranslationFromPretrainedBARTTask(TranslationTask):

     def __init__(self, args, src_dict, tgt_dict):
         super().__init__(args, src_dict, tgt_dict)
+        self.args = args
         self.langs = args.langs.split(",")
         for d in [src_dict, tgt_dict]:
             for l in self.langs:

here, but I have no idea whether you still need it in your (newer) version of fairseq.

I stored everything in the data folder and run my scripts from the solution folder, so the hierarchy looks this way:

- data/
-- data/data-bin/
-- data/ParaPhraserPlus
-- data/preprocessed_data/
-- data/WikiSimple-translated
- solution/

Assuming that you are at the solution folder, run:

mkdir ../data
mkdir ../data/preprocessed_data/

And download the mBART checkpoint:

wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz
tar -xzvf mbart.cc25.v2.tar.gz -C ../data

Data Preprocessing

Everything here is written in the assumption that you are running it from the solution folder. Sorry, I was a bit lazy to configure it ~~anyhow~~ better.
However, you can skip these steps and use the data from the downloads section.

Specify the following environment variables:

SPM=<path to sentencepiece>
BPE_MODEL=../data/mbart.cc25.v2/sentence.bpe.model
DATA_DIR=../data/preprocessed_data
PREPROCESSED_DATA_DIR=../data/data-bin
DICT=../data/mbart.cc25.v2/dict.txt

Prepare data: download ParaPhraserPlus and automatically translated WikiSimple and extract them to the data folder. Run:

python prepare_control_tokens.py
python merge_data.py

to prepare data.

Preprocess everything with sentencepiece:

${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/train.src > ${DATA_DIR}/train.spm.src &
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/valid.src > ${DATA_DIR}/valid.spm.src &
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/train.dst > ${DATA_DIR}/train.spm.dst &
${SPM} --model=${BPE_MODEL} < ${DATA_DIR}/valid.dst > ${DATA_DIR}/valid.spm.dst &

Add the control tokens to the data:

python add_control_tokens.py

It finds the control tokens for each (src, dst) pair and finds unused tokens from the dictionary that can be used for the conditioning. It would have been cleaner to use some new tokens for this purpose, but I didn't know for sure how to add new tokens to a model in fairseq, so I stuck to a hackier option.

Run the binarization function:

fairseq-preprocess \
  --source-lang src \
  --target-lang dst \
  --trainpref ${DATA_DIR}/train.spm \
  --validpref ${DATA_DIR}/valid.spm \
  --destdir ${PREPROCESSED_DATA_DIR} \
  --thresholdtgt 0 \
  --thresholdsrc 0 \
  --srcdict ${DICT} \
  --tgtdict ${DICT} \
  --workers 70

Training

Train:

PRETRAIN=../data/mbart.cc25.v2/model.pt
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN
CUDA_VISIBLE_DEVICES=0
fairseq-train ../data/data-bin \
  --encoder-normalize-before --decoder-normalize-before \
  --arch mbart_large --layernorm-embedding \
  --task translation_from_pretrained_bart \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
  --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
  --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 10000 --total-num-update 100000 \
  --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
  --max-tokens 1024 \
  --source-lang src --target-lang dst \
  --batch-size 8 \
  --update-freq 4 \
  --validate-interval 1 \
  --patience 3 \
  --max-epoch 25 \
  --save-interval-updates 500 --keep-interval-updates 1 --keep-best-checkpoints 1 --no-save-optimizer-state \
  --seed 42 --log-format tqdm \
  --restore-file ${PRETRAIN} \
  --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
  --ddp-backend no_c10d \
  --langs $langs \
  --scoring bleu \
  --save-dir ../checkpoints

Find batch-size and update-freq that suits your gpu better, use fp16 whenever it's possible.

Generate:

CUDA_VISIBLE_DEVICES=0
LANG=C.UTF-8 LC_ALL=C.UTF-8
fairseq-generate ${DATA_DIR} \
  --path ${SAVE_DIR}/checkpoint_best.pt \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  --source-lang src --target-lang dst \
  --bpe 'sentencepiece' --sentencepiece-model ${BPE_MODEL} \
  --sacrebleu --remove-bpe 'sentencepiece' \
  --batch-size 32 --langs $langs > model_prediction.txt & 

cat model_prediction.txt | grep -P "^H" |sort -V |cut -f 3- > model_prediction.hyp

Find the control tokens that work better. Generate data with random control token combinations and choose the best by SARI on the dev set:

python generate_devs_with_control_tokens.py

The script generates lots of lines of dev set conditioned on different control tokens. Generate the simplifications for them from your model and evaluate them using easse evaluate script. Find the best params this way.

I used the following params:

NbChars  = 0.95
LevSim   = 0.4 
WordRank = 1.6

To be honest, the model hallucinate a lot in such setup, but SARI prefers it to any other (more sane in my opition) combination of tokens...
Better control tokens could have been selected using some human evaluation if it wasn't so expensive, ofc.

Preprocess the test dataset as in Data Preprocessing step 3 and run:

python generate_test_with_control_tokens.py

with your optimal tokens here.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
solution		solution
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RuSimpleSentEval

Downloads

Train

Requirements Installation

Data Preprocessing

Training

About

Languages

DanAnastasyev/RuSimpleSentEval

Folders and files

Latest commit

History

Repository files navigation

RuSimpleSentEval

Downloads

Train

Requirements Installation

Data Preprocessing

Training

About

Resources

Stars

Watchers

Forks

Languages