NAACL2022-REFLECT

Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness

Author: @Yun-Zhu Song, @Yi-Syuan Chen,Hong-Han Shuai

Referenced Environment Setup

pip install -r requirements.txt

Dataset Preparation

Option 1.

NOTE:

We provide an example for processing multi-news end-to-end in src/scripts/construct_dataset_end2end.sh
The names of datasets can be found in src/data/build_datasets.py.

Steps: (1) download the dataset; (2) get the pseudo extractio oracle and rouge scores for each document sentence; (3) generate summary from the fine-tuned abstractor and merge the generated summary to the dataset.

(1) download dataset

cd src
./scripts/step1_download_dataset.sh

(2) get pseudo extraction (take multi_news as examples)

cd src
./scripts/step2_build_POR_label.sh

(3) generate summary from finetuned abstractor (multi_news) and merge the generated results to dataset. (take multi_news as examples)

cd src
./scripts/step3_generate_SR_to_dataset.sh

Option 2. Dowload Our Processed Dataset

Please place the dataset at datasets/ext_oracle/ according to the following code structure or change the dataset directory path writing in src/data/build_datasets.py.

Multi-News, Xscience, WikiCatSum

Code Structure

src\
  |_main.py -> main function
  |_process.py -> for defining different operation process
  |_scripts
    |_args\
      |_finetune_abs_base_O.json -> configuration of finetuning abstractor (base, oracle input) for supporting extractor RL training
      |_finetune_abs_large_O.json -> configuration of finetuning abstractor (large, oracle input) for test time inference
      |_finetune_abs_large_A.json -> configuration of finetuning abstractor (large, article input) for providing summary reference
      |_train_ext_mle.json -> configuration of training extractor with MLE (extractor pretraining)
      |_train_ext_rl.json -> configuration of training extractor with RL (extractor training)
      |_pred.json -> configuration of obtaining the extraction prediction
      |_eval.json -> configuration of evaluating the extraction results
    |_run.sh -> recording the scripts for the training and evaluation steps
    |_construct_dataset_end2end.sh -> an example for constructing the multi_news end-to-end

datasets\
  |_origin\
    |_multi_news\
    |_xscience\
    |_wikicatsum\
  |_ext_oracle\ -> put the processed datasets in this directory
    |_multi_news\
    |_xscience\
    |_wikicatsum\
      |_animal\
      |_company\
      |_film\
      
outputs\ -> directory for saving experiments
  |_multi_news\
    |_finetuned_abs\
      |_bart-base-O\ -> for supporting extractor RL training
      |_bart-large-O\ -> for test time inference
      |_bart-large-A\ -> for generating summary reference
    |_extractor_mle\
      |_SR_POR\ -> pretrained extractor
    |_extractor_rl\
      |_final\ -> final model

Trained Model

Dataset	Finetuned Abstractor	Pretrained (REFLECT-MLE)	Final (REFLECT)
Multi-News	Bart-base-Oracle, Bart-large-Oracle, Bart-large-Article	SR_POR	final
XScience			final

Training

1. Abstractor Training

There are 4 different configs for abstractor.

Model Size	Input Type
BART Base	Oracle
BAET Base	Article
BART Large	Oracle
BAET Large	Article

How to change to different configs

dataset_name	Oracle Text Column	Article Text Column
multi_news_bl_own	summary_ext	document
xscience_bl_own	summary_ext	document

cd src
python main.py ./scirpts/args/finetune_abs_base_O.json
python main.py ./scirpts/args/finetine_abs_large_O.json
python main.py ./scirpts/args/finetine_abs_large_A.json

2. Extractor Pretraining

cd src
python main.py ./scripts/args/train_ext_mle.json

3. Extractor Training

cd src
python main.py ./scripts/args/train_ext_rl.json

4. Model Evaluation

cd src
python main.py ./scripts/args/pred.json
python main.py ./scripts/args/eval.json

Argument Description

Arguments for switching between abstractor training or extractor training

"task_type": "seq2seq" for abstractor. "two_stage_extraction" for extractor.
"training_type": "mle" for abstractor finetuning. "ext_mle" for extractor pretraining. "ext_rl" for extractor training.
"data_preprocess": "doc_trun" for abstractor. "doc_trun_and_build_sent_index" for extractor.

Arguments for extractor only

"summary_ext_column": "summary_ext"

Arguments for training our extractor:

"ext_model_name_or_path": Specify the model name or path to give the extractor config. default: deepset/roberta-base-squad2.
"different_base_model_for_two_stage": Specify true when the extractor config and abstractor config are different. default: true.
"load_trained_abstractor_from": Specify the model path for finetuned abstractor.
"load_trained_extractor_from": Specify the model path for pretrained extractor.
"train_only": Specify module name for training the module. default:"extractor"

Arguments for model configuration:

"score_cls_weighting": whether to adopt Peudo Oracle Relaxation (POR), true or false.
"reference_extraction": wether to adopt Summary Referencing (SR), true or false. If true, need to assign the "reference_column" to the column of pregenearted summary.
"reference_column": Assign the column of pregenerated summary in dataset. Only activate when \"reference_extraction\" is true. default: "summary_gen".
""num_hierarchical_layer"": Number of hierarchical layers in extractor, 0 means flat structure for controlling loading pretrained model. Used in main.py. default:3.

Arguments for reinforcement learning:

"use_mixer_loss": Whether to consider the MLE loss. dedault: true.
"mixer_weight": The weight for mixing the MLE and RL loss. default: 0.1.
"update_full_action": Wether to update the full action or only update the output with the sampled action that are different from the greedy action. false for CASC, true for SC.

Citation

@inproceedings{song-etal-2022-improving,
    title = "Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness",
    author = "Song, Yun-Zhu  and
      Chen, Yi-Syuan  and
      Shuai, Hong-Han",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.120",
    doi = "10.18653/v1/2022.naacl-main.120",
    pages = "1667--1681",
}

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAACL2022-REFLECT

Referenced Environment Setup

Dataset Preparation

Option 1.

Option 2. Dowload Our Processed Dataset

Code Structure

Trained Model

Training

1. Abstractor Training

2. Extractor Pretraining

3. Extractor Training

4. Model Evaluation

Argument Description

About

Releases

Packages

Languages

yunzhusong/NAACL2022-REFLECT

Folders and files

Latest commit

History

Repository files navigation

NAACL2022-REFLECT

Referenced Environment Setup

Dataset Preparation

Option 1.

Option 2. Dowload Our Processed Dataset

Code Structure

Trained Model

Training

1. Abstractor Training

2. Extractor Pretraining

3. Extractor Training

4. Model Evaluation

Argument Description

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages