Comparative analysis of summarization models
SummComparer is an initiative aimed at compiling, scrutinizing, and analyzing a Summarization Gauntlet with the goal of understanding/improving what makes a summarization model do well in practical everyday use cases.
The latest version of the dataset can also be found on huggingface here and loaded with datasets
.
SummComparer's main aim is to test how well various summarization models work on long documents from a wide range of topics, none of which are part of standard training data1. This "gauntlet" of topics helps us see how well the models can summarize both familiar and unfamiliar content. By doing this, we can understand how these models might perform in real-world situations where the content is unpredictable2. This also helps us identify their limitations and ideally, understand what makes them work well.
Put another way, SummComparer can be thought of as a case study for the following scenario:
- You have a collection of documents that you need to summarize/understand for
<reason>
- You don't know what domain(s) these documents belong to because you haven't read them, and you don't have the time or inclination to read them fully.
- You're hoping to get a general understanding of these documents from summaries, and then plan to decide which ones to do more in-depth reading on.
- You're not sure what the ideal summaries of these documents are because if you knew that, you wouldn't need to summarize them with a language model.
- So: Which model(s) should you use? How can you determine if the outputs are faithful without reading the source documents? How can you determine whether the model is performing well or not?
The idea for this project was born out of necessity: to test whether a summarization model was "good" or not, I would run it on a consistent set of documents and compare the generated summaries with the outputs of other models and my growing understanding of the documents themselves.
If <new summarization model or technique>
claiming to be amazing is unable to summarize the navy seals copypasta, OCR'd powerpoint slides, or a short story, then it's probably not going to be very useful in the real world.
From pandas-profiling
:
To install the necessary packages, run the following command:
pip install -r requirements.txt
To install the package requirements for using the scripts in bin/
, navigate to that directory and run:
pip install -r bin/requirements.txt
As the dataset is already compiled, you can skip to the Working with the Dataset section for most use cases.
The current version supports Command Line Interface (CLI) usage. The recommended sequence of operations is as follows:
export_gauntlet.py
map_gauntlet_files.py
build_src_df.py
All CLI scripts utilize the fire
package for CLI generation. For more information on how to use the CLI, run:
python <script_name>.py --help
Note: The current version of the dataset is in a "raw" format. It has not been cleaned or pruned of unnecessary columns. This will be addressed in a future release.
The dataset files are located in as-dataset/
and are saved as .parquet
files. The dataset comprises two files, which can be conceptualized as two tables in a relational database:
as-dataset/gauntlet_input_documents.parquet
: This file contains the input documents for the gauntlet along with metadata/id
fields as defined ingauntlet_master_data.json
.as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet
: This file contains the output summaries for the gauntlet with hyperparameters/models as columns. All summaries (rows) are mapped to their source documents (columns) by columns prefixed withsource_doc
.
You can load the data using pandas
:
import pandas as pd
df = pd.read_parquet('as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet')
df.info()
The gauntlet_input_documents.parquet
file is required only if you need to examine the source documents themselves or perform any analysis using their text. Most of the necessary information is available in the summary_gauntlet_dataset_mapped_src_docs.parquet
file.
The gauntlet_input_documents.parquet
file contains the following columns:
>>> import pandas as pd
>>> df = pd.read_parquet("as-dataset/gauntlet_input_documents.parquet").convert_dtypes()
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 source_doc_filename 19 non-null string
1 source_doc_id 19 non-null string
2 source_doc_domain 19 non-null string
3 document_text 19 non-null string
dtypes: string(4)
memory usage: 736.0 bytes
The source_doc_id
column, present in both files, can be used to join them together. A script that does this for you can be found in bin/
:
python bin/create_merged_df.py
There are numerous Exploratory Data Analysis (EDA) tools available. For initial exploration and testing, dtale
is recommended due to its flexibility and user-friendly interface. Install it with:
pip install dtale
You can then launch a UI instance from the command line with:
dtale --parquet-path as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet
Please note that this project is a work in progress. Future updates will include data cleaning, removal of unnecessary columns, and additional features to enhance the usability and functionality of the project.
Footnotes
-
As it turns out, the practical application of summarization models is not the ritual of summarizing documents you already know the summary of and benchmarking their ability to regurgitate these back to you via ROUGE scores as a testament of their performance. Who knew? ↩
-
i.e. you are not trying to hit a high score on the test set of arXiv summarization as a measure of a "good model", but rather actually read and use the summaries in real life. ↩