Skip to content

Latest commit

 

History

History
149 lines (100 loc) · 7.15 KB

README.md

File metadata and controls

149 lines (100 loc) · 7.15 KB

SummComparer

Comparative analysis of summarization models

⚠️ This project is currently under active development and will continue to evolve over time. ⚠️

SummComparer is an initiative aimed at compiling, scrutinizing, and analyzing a Summarization Gauntlet with the goal of understanding/improving what makes a summarization model do well in practical everyday use cases.

The latest version of the dataset can also be found on huggingface here and loaded with datasets.



About

SummComparer's main aim is to test how well various summarization models work on long documents from a wide range of topics, none of which are part of standard training data1. This "gauntlet" of topics helps us see how well the models can summarize both familiar and unfamiliar content. By doing this, we can understand how these models might perform in real-world situations where the content is unpredictable2. This also helps us identify their limitations and ideally, understand what makes them work well.

A Case Study

Put another way, SummComparer can be thought of as a case study for the following scenario:

  • You have a collection of documents that you need to summarize/understand for <reason>
  • You don't know what domain(s) these documents belong to because you haven't read them, and you don't have the time or inclination to read them fully.
    • You're hoping to get a general understanding of these documents from summaries, and then plan to decide which ones to do more in-depth reading on.
  • You're not sure what the ideal summaries of these documents are because if you knew that, you wouldn't need to summarize them with a language model.
  • So: Which model(s) should you use? How can you determine if the outputs are faithful without reading the source documents? How can you determine whether the model is performing well or not?

The idea for this project was born out of necessity: to test whether a summarization model was "good" or not, I would run it on a consistent set of documents and compare the generated summaries with the outputs of other models and my growing understanding of the documents themselves.

If <new summarization model or technique> claiming to be amazing is unable to summarize the navy seals copypasta, OCR'd powerpoint slides, or a short story, then it's probably not going to be very useful in the real world.

EDA links

From pandas-profiling:

Installation

To install the necessary packages, run the following command:

pip install -r requirements.txt

To install the package requirements for using the scripts in bin/, navigate to that directory and run:

pip install -r bin/requirements.txt

Usage

As the dataset is already compiled, you can skip to the Working with the Dataset section for most use cases.

Compiling the Gauntlet

The current version supports Command Line Interface (CLI) usage. The recommended sequence of operations is as follows:

export_gauntlet.py
map_gauntlet_files.py
build_src_df.py

All CLI scripts utilize the fire package for CLI generation. For more information on how to use the CLI, run:

python <script_name>.py --help

Working with the Dataset

Note: The current version of the dataset is in a "raw" format. It has not been cleaned or pruned of unnecessary columns. This will be addressed in a future release.

The dataset files are located in as-dataset/ and are saved as .parquet files. The dataset comprises two files, which can be conceptualized as two tables in a relational database:

  • as-dataset/gauntlet_input_documents.parquet: This file contains the input documents for the gauntlet along with metadata/id fields as defined in gauntlet_master_data.json.
  • as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet: This file contains the output summaries for the gauntlet with hyperparameters/models as columns. All summaries (rows) are mapped to their source documents (columns) by columns prefixed with source_doc.

You can load the data using pandas:

import pandas as pd
df = pd.read_parquet('as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet')
df.info()

Input Documents

The gauntlet_input_documents.parquet file is required only if you need to examine the source documents themselves or perform any analysis using their text. Most of the necessary information is available in the summary_gauntlet_dataset_mapped_src_docs.parquet file.

The gauntlet_input_documents.parquet file contains the following columns:

>>> import pandas as pd
>>> df = pd.read_parquet("as-dataset/gauntlet_input_documents.parquet").convert_dtypes()
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
0   source_doc_filename  19 non-null     string

1   source_doc_id        19 non-null     string
2   source_doc_domain    19 non-null     string
3   document_text        19 non-null     string
dtypes: string(4)
memory usage: 736.0 bytes

The source_doc_id column, present in both files, can be used to join them together. A script that does this for you can be found in bin/:

python bin/create_merged_df.py

Exploring the Dataset

There are numerous Exploratory Data Analysis (EDA) tools available. For initial exploration and testing, dtale is recommended due to its flexibility and user-friendly interface. Install it with:

pip install dtale

You can then launch a UI instance from the command line with:

dtale --parquet-path as-dataset/summary_gauntlet_dataset_mapped_src_docs.parquet

Please note that this project is a work in progress. Future updates will include data cleaning, removal of unnecessary columns, and additional features to enhance the usability and functionality of the project.

Footnotes

  1. As it turns out, the practical application of summarization models is not the ritual of summarizing documents you already know the summary of and benchmarking their ability to regurgitate these back to you via ROUGE scores as a testament of their performance. Who knew?

  2. i.e. you are not trying to hit a high score on the test set of arXiv summarization as a measure of a "good model", but rather actually read and use the summaries in real life.