Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness
Author: @Yun-Zhu Song, @Yi-Syuan Chen,Hong-Han Shuai
pip install -r requirements.txt
NOTE:
- We provide an example for processing multi-news end-to-end in src/scripts/construct_dataset_end2end.sh
- The names of datasets can be found in src/data/build_datasets.py.
Steps: (1) download the dataset; (2) get the pseudo extractio oracle and rouge scores for each document sentence; (3) generate summary from the fine-tuned abstractor and merge the generated summary to the dataset.
(1) download dataset
cd src
./scripts/step1_download_dataset.sh
(2) get pseudo extraction (take multi_news as examples)
cd src
./scripts/step2_build_POR_label.sh
(3) generate summary from finetuned abstractor (multi_news) and merge the generated results to dataset. (take multi_news as examples)
cd src
./scripts/step3_generate_SR_to_dataset.sh
Please place the dataset at datasets/ext_oracle/ according to the following code structure or change the dataset directory path writing in src/data/build_datasets.py.
Multi-News, Xscience, WikiCatSum
src\
|_main.py -> main function
|_process.py -> for defining different operation process
|_scripts
|_args\
|_finetune_abs_base_O.json -> configuration of finetuning abstractor (base, oracle input) for supporting extractor RL training
|_finetune_abs_large_O.json -> configuration of finetuning abstractor (large, oracle input) for test time inference
|_finetune_abs_large_A.json -> configuration of finetuning abstractor (large, article input) for providing summary reference
|_train_ext_mle.json -> configuration of training extractor with MLE (extractor pretraining)
|_train_ext_rl.json -> configuration of training extractor with RL (extractor training)
|_pred.json -> configuration of obtaining the extraction prediction
|_eval.json -> configuration of evaluating the extraction results
|_run.sh -> recording the scripts for the training and evaluation steps
|_construct_dataset_end2end.sh -> an example for constructing the multi_news end-to-end
datasets\
|_origin\
|_multi_news\
|_xscience\
|_wikicatsum\
|_ext_oracle\ -> put the processed datasets in this directory
|_multi_news\
|_xscience\
|_wikicatsum\
|_animal\
|_company\
|_film\
outputs\ -> directory for saving experiments
|_multi_news\
|_finetuned_abs\
|_bart-base-O\ -> for supporting extractor RL training
|_bart-large-O\ -> for test time inference
|_bart-large-A\ -> for generating summary reference
|_extractor_mle\
|_SR_POR\ -> pretrained extractor
|_extractor_rl\
|_final\ -> final model
Dataset | Finetuned Abstractor | Pretrained (REFLECT-MLE) | Final (REFLECT) |
---|---|---|---|
Multi-News | Bart-base-Oracle, Bart-large-Oracle, Bart-large-Article | SR_POR | final |
XScience | final |
There are 4 different configs for abstractor.
Model Size | Input Type |
---|---|
BART Base | Oracle |
BAET Base | Article |
BART Large | Oracle |
BAET Large | Article |
How to change to different configs
dataset_name | Oracle Text Column | Article Text Column |
---|---|---|
multi_news_bl_own | summary_ext | document |
xscience_bl_own | summary_ext | document |
cd src
python main.py ./scirpts/args/finetune_abs_base_O.json
python main.py ./scirpts/args/finetine_abs_large_O.json
python main.py ./scirpts/args/finetine_abs_large_A.json
cd src
python main.py ./scripts/args/train_ext_mle.json
cd src
python main.py ./scripts/args/train_ext_rl.json
cd src
python main.py ./scripts/args/pred.json
python main.py ./scripts/args/eval.json
Arguments for switching between abstractor training or extractor training
"task_type": "seq2seq" for abstractor. "two_stage_extraction" for extractor.
"training_type": "mle" for abstractor finetuning. "ext_mle" for extractor pretraining. "ext_rl" for extractor training.
"data_preprocess": "doc_trun" for abstractor. "doc_trun_and_build_sent_index" for extractor.
Arguments for extractor only
"summary_ext_column": "summary_ext"
Arguments for training our extractor:
"ext_model_name_or_path": Specify the model name or path to give the extractor config. default: deepset/roberta-base-squad2.
"different_base_model_for_two_stage": Specify true when the extractor config and abstractor config are different. default: true.
"load_trained_abstractor_from": Specify the model path for finetuned abstractor.
"load_trained_extractor_from": Specify the model path for pretrained extractor.
"train_only": Specify module name for training the module. default:"extractor"
Arguments for model configuration:
"score_cls_weighting": whether to adopt Peudo Oracle Relaxation (POR), true or false.
"reference_extraction": wether to adopt Summary Referencing (SR), true or false. If true, need to assign the "reference_column" to the column of pregenearted summary.
"reference_column": Assign the column of pregenerated summary in dataset. Only activate when \"reference_extraction\" is true. default: "summary_gen".
""num_hierarchical_layer"": Number of hierarchical layers in extractor, 0 means flat structure for controlling loading pretrained model. Used in main.py. default:3.
Arguments for reinforcement learning:
"use_mixer_loss": Whether to consider the MLE loss. dedault: true.
"mixer_weight": The weight for mixing the MLE and RL loss. default: 0.1.
"update_full_action": Wether to update the full action or only update the output with the sampled action that are different from the greedy action. false for CASC, true for SC.
Citation
@inproceedings{song-etal-2022-improving,
title = "Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness",
author = "Song, Yun-Zhu and
Chen, Yi-Syuan and
Shuai, Hong-Han",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.120",
doi = "10.18653/v1/2022.naacl-main.120",
pages = "1667--1681",
}