Skip to content

Source code for paper "Document-level Text Simplification with Coherence Evaluation" at the TSAR 2023 workshop.

License

Notifications You must be signed in to change notification settings

lmvasque/ts-coherence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document-level Text Simplification with Coherence Evaluation

Source code for the paper: Document-level Text Simplification with Coherence Evaluation accepted in the Second Workshop on Text Simplification, Accessibility and Readability (TSAR 2023) at the Recent Advances in Natural Language Processing Conference (RANLP 2023).

By @lmvasquezr, @MattShardlow , Piotr Przybyła and @SAnaniadou. If you have any questions, please don't hesitate to contact us.

Setup

This code was tested using Python 3.7+. You can setup our repo as follows:

git clone https://github.com/lmvasque/ts-coherence.git
cd ts-coherence
pip install -r requirements.txt

Datasets

We have selected the D-Wikipedia and Cochrane datasets for training. For testing, we have used the OneStopCorpus. These datasets can be downloaded from:

Models

1. Fine-tuning of MUSS model

We fine-tuned the MUSS model by using the script below. You can refer to the setup of the model in the original MUSS model repo.

python scripts/train_model.py

We have run with the original code, including minor modifications in train_model.py to extend the length of the outputs. This setting is already set in the latest MUSS model. Also, we have modified the file training.py to add our datasets for training and test.

2. Generation of system outputs

After the MUSS model is fine-tuned, we generate our simplifications based on the code from this script:

python scripts/simplify.py

This code is limited to work with the original models, so additional changes are needed to add new fine-tuned models.

3. Simplifications evaluation

For the FKGL scores, we evaluated the model outputs using EASSE:

easse evaluate --orig_sents_path complex.txt --test_set custom -i simplified.txt --refs_sents_paths references.txt

For D-SARI, we used the evaluation scripts published on the original repository.

For the Coherence evaluation, we requested the code & dataset for the GCDC corpus here.

From this code, we retrained the Parseq model using the Yahoo dataset as follows:

python main.py --model_name yahoo_class_model --train_corpus Yahoo --model_type par_seq --task class

4. About the source code

We include our source code as a reference to the developed solution for our specific case. The code is mostly related to the evaluation and analysis of our results. Nevertheless, we consider that the overall idea can be applied straight forward to any research following the steps above. We are happy to answer any question :)

Citation

If you use our results in your research, please cite our work: Document-level Text Simplification with Coherence Evaluation

@inproceedings{vasquez-rodriguez-etal-2023-document,
    title = "Document-level Text Simplification with Coherence Evaluation",
    author = "V{\'a}squez-Rodr{\'\i}guez, Laura  and
      Shardlow, Matthew  and
      Przyby{\l}a, Piotr  and
      Ananiadou, Sophia",
    booktitle = "Proceedings of the Second Workshop on Text Simplification, Accessibility, and Readability (TSAR-2023)",
    month = sept,
    year = "2023",
    address = "Varna, Bulgaria",
    url = "https://tsar-workshop.github.io/program/papers/vasquez-rodriguez-etal-2023-document.pdf"
}

About

Source code for paper "Document-level Text Simplification with Coherence Evaluation" at the TSAR 2023 workshop.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages