Text Summarization using Knowledge Distillation

Overview

This project focuses on creating a smaller, faster student text summarization model by distilling knowledge from a larger pre-trained teacher model (PEGASUS). Key aspects include:

Using the PEGASUS model as the teacher
Creating a smaller student model with fewer parameters
Applying knowledge distillation to transfer capabilities of teacher to student
Training student model on the CNN/DailyMail dataset
Evaluating models using ROUGE metrics
Comparing performance and speed of teacher vs student

Results

Student model has a size of 1.4 GB and is 2x faster for inference
Student model achieves comparable ROUGE scores to the teacher:
- Rouge-1: 39.27
- Rouge-2: 18.65
- Rouge-L: 29.06
- RougeLSum: 35.93

Installation

The code is implemented in Python, utilizing Hugging Face's Transformers library and TensorFlow. All the dependancies are listed in "requirements.txt" file.

Dataset Download

Download the dataset from https://www.kaggle.com/datasets/jainishadesara/cnn-dailymail link. Make a folder named "Data" in your working directory. Put all the six downloaded data files in "Data" folder.

Usage

Run the "distillation.py" to download and preprocess the data, to make the student model from teacher model, and finetune the student model to get the Rouge1, Rouge2, Rougel and RougeLSum scores as output in Result/metric.json file.

Recommended to run on colab

Guide

To run on colab

Step 1- Change python version by running the following in a cell and enter(2) to select python 3.8 (or anything less then 3.9)

!sudo update-alternatives --config python3

Step 2 - Run this

!python --version !sudo apt install python3-pip

Step 3 - Install Dependencies by running this

!pip install fire
! pip install transformers==4.1.0
!pip install sentencepiece
!pip install pytorch-lightning==1.0.4
! pip install datasets
! pip install rouge_score
! pip install GitPython
! pip install rouge_score
! pip install sacrebleu
! pip install protobuf==3.20.*

Step 4 - Change the data_dir variable to give path of you files.

"data_dir": "your path to the data files folder",

Step 5 - Run the distillation.py file by running this

! python /path/to/your/dir/distillation.py

Credits

We used some of the approach from the following repository: https://github.com/huggingface/transformers/blob/e2c935f5615a3c15ee7439fa8a560edd5f13a457/examples/seq2seq

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Readme.md		Readme.md
distillation.py		distillation.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Summarization using Knowledge Distillation

Overview

Results

Installation

Dataset Download

Usage

Recommended to run on colab

Guide

Credits

About

Releases

Packages

Languages

Jainish021/Lightweight-Text-Summarizer

Folders and files

Latest commit

History

Repository files navigation

Text Summarization using Knowledge Distillation

Overview

Results

Installation

Dataset Download

Usage

Recommended to run on colab

Guide

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages