Skip to content

Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



51 Commits

Repository files navigation


Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness MDS_framework

Author: @Yun-Zhu Song, @Yi-Syuan Chen,Hong-Han Shuai

Referenced Environment Setup

pip install -r requirements.txt

Dataset Preparation

Option 1.


  1. We provide an example for processing multi-news end-to-end in src/scripts/
  2. The names of datasets can be found in src/data/

Steps: (1) download the dataset; (2) get the pseudo extractio oracle and rouge scores for each document sentence; (3) generate summary from the fine-tuned abstractor and merge the generated summary to the dataset.

(1) download dataset

cd src

(2) get pseudo extraction (take multi_news as examples)

cd src

(3) generate summary from finetuned abstractor (multi_news) and merge the generated results to dataset. (take multi_news as examples)

cd src

Option 2. Dowload Our Processed Dataset

Please place the dataset at datasets/ext_oracle/ according to the following code structure or change the dataset directory path writing in src/data/

Multi-News, Xscience, WikiCatSum

Code Structure

  | -> main function
  | -> for defining different operation process
      |_finetune_abs_base_O.json -> configuration of finetuning abstractor (base, oracle input) for supporting extractor RL training
      |_finetune_abs_large_O.json -> configuration of finetuning abstractor (large, oracle input) for test time inference
      |_finetune_abs_large_A.json -> configuration of finetuning abstractor (large, article input) for providing summary reference
      |_train_ext_mle.json -> configuration of training extractor with MLE (extractor pretraining)
      |_train_ext_rl.json -> configuration of training extractor with RL (extractor training)
      |_pred.json -> configuration of obtaining the extraction prediction
      |_eval.json -> configuration of evaluating the extraction results
    | -> recording the scripts for the training and evaluation steps
    | -> an example for constructing the multi_news end-to-end

  |_ext_oracle\ -> put the processed datasets in this directory
outputs\ -> directory for saving experiments
      |_bart-base-O\ -> for supporting extractor RL training
      |_bart-large-O\ -> for test time inference
      |_bart-large-A\ -> for generating summary reference
      |_SR_POR\ -> pretrained extractor
      |_final\ -> final model

Trained Model

Dataset Finetuned Abstractor Pretrained (REFLECT-MLE) Final (REFLECT)
Multi-News Bart-base-Oracle, Bart-large-Oracle, Bart-large-Article SR_POR final
XScience final


1. Abstractor Training

There are 4 different configs for abstractor.

Model Size Input Type
BART Base Oracle
BAET Base Article
BART Large Oracle
BAET Large Article

How to change to different configs

dataset_name Oracle Text Column Article Text Column
multi_news_bl_own summary_ext document
xscience_bl_own summary_ext document
cd src
python ./scirpts/args/finetune_abs_base_O.json
python ./scirpts/args/finetine_abs_large_O.json
python ./scirpts/args/finetine_abs_large_A.json

2. Extractor Pretraining

cd src
python ./scripts/args/train_ext_mle.json

3. Extractor Training

cd src
python ./scripts/args/train_ext_rl.json

4. Model Evaluation

cd src
python ./scripts/args/pred.json
python ./scripts/args/eval.json

Argument Description

Arguments for switching between abstractor training or extractor training

"task_type": "seq2seq" for abstractor. "two_stage_extraction" for extractor.
"training_type": "mle" for abstractor finetuning. "ext_mle" for extractor pretraining. "ext_rl" for extractor training.
"data_preprocess": "doc_trun" for abstractor. "doc_trun_and_build_sent_index" for extractor.

Arguments for extractor only

"summary_ext_column": "summary_ext"

Arguments for training our extractor:

"ext_model_name_or_path": Specify the model name or path to give the extractor config. default: deepset/roberta-base-squad2.
"different_base_model_for_two_stage": Specify true when the extractor config and abstractor config are different. default: true.
"load_trained_abstractor_from": Specify the model path for finetuned abstractor.
"load_trained_extractor_from": Specify the model path for pretrained extractor.
"train_only": Specify module name for training the module. default:"extractor"

Arguments for model configuration:

"score_cls_weighting": whether to adopt Peudo Oracle Relaxation (POR), true or false.
"reference_extraction": wether to adopt Summary Referencing (SR), true or false. If true, need to assign the "reference_column" to the column of pregenearted summary.
"reference_column": Assign the column of pregenerated summary in dataset. Only activate when \"reference_extraction\" is true. default: "summary_gen".
""num_hierarchical_layer"": Number of hierarchical layers in extractor, 0 means flat structure for controlling loading pretrained model. Used in default:3.

Arguments for reinforcement learning:

"use_mixer_loss": Whether to consider the MLE loss. dedault: true.
"mixer_weight": The weight for mixing the MLE and RL loss. default: 0.1.
"update_full_action": Wether to update the full action or only update the output with the sampled action that are different from the greedy action. false for CASC, true for SC.


    title = "Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness",
    author = "Song, Yun-Zhu  and
      Chen, Yi-Syuan  and
      Shuai, Hong-Han",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "10.18653/v1/2022.naacl-main.120",
    pages = "1667--1681",


Code for the paper: Improving Multi-Document Summarization through Referenced Flexible Extraction with Credit-Awareness







No releases published


No packages published