A production-ready text summarization pipeline built with Hugging Face Transformers. This project demonstrates fine-tuning state-of-the-art models on conversational data and deploying them as a robust, scalable service.
This project provides an end-to-end workflow for building a custom text summarization model, from data ingestion to deployment. It uses pretrained transformer models (Pegasus, T5, FLAN-T5) and fine-tunes them on the SAMSum conversational dataset to generate high-quality abstractive summaries.
- State-of-the-art Models: Leverage pretrained transformers from Hugging Face Model Hub
- Conversational Focus: Fine-tuned on SAMSum dataset for dialogue and chat summarization
- Modular Architecture: Clean, reusable components for data processing, training, and inference
- Flexible Backbone: Easy switching between Pegasus, T5, FLAN-T5, and other models
- Production Ready: Includes REST API, batch processing, and containerization support
- Comprehensive Evaluation: ROUGE metrics and custom evaluation pipelines
- Installation
- Project Structure
- Quick Start
- Dataset
- Models
- Usage
- Configuration
- Development Workflow
- API Reference
- Contributing
- License
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for training)
- 8GB+ RAM
# Clone the repository
git clone https://github.com/<your-username>/text-summarizer.git
cd text-summarizer
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# (Optional) Login to Hugging Face for model access
huggingface-cli logintext-summarizer/
βββ README.md
βββ requirements.txt
βββ setup.py
βββ .env.example # Environment variables template
βββ Dockerfile # Container configuration
βββ config/
β βββ config.yaml # Main configuration file
β βββ params.yaml # Training hyperparameters
β βββ logging.yaml # Logging configuration
βββ data/
β βββ raw/ # Original SAMSum dataset
β βββ processed/ # Tokenized and preprocessed data
β βββ samples/ # Test samples for quick validation
βββ notebooks/
β βββ 01_eda_samsum.ipynb # Exploratory data analysis
β βββ 02_model_comparison.ipynb # Model performance comparison
β βββ 03_error_analysis.ipynb # Failed predictions analysis
βββ src/
β βββ __init__.py
β βββ config/
β β βββ __init__.py
β β βββ configuration.py # Configuration manager
β β βββ entity.py # Config entity classes
β βββ components/
β β βββ __init__.py
β β βββ data_ingestion.py
β β βββ data_transformation.py
β β βββ model_trainer.py
β β βββ model_evaluation.py
β βββ pipeline/
β β βββ __init__.py
β β βββ training_pipeline.py
β β βββ prediction_pipeline.py
β βββ utils/
β β βββ __init__.py
β β βββ common.py # Utility functions
β β βββ logger.py # Custom logging
β βββ api.py # FastAPI application
βββ scripts/
β βββ run_pipeline.sh # Full pipeline execution
β βββ train.sh # Training script
β βββ evaluate.sh # Evaluation script
β βββ serve.sh # API server launcher
βββ models/
β βββ checkpoints/ # Fine-tuned model weights
β βββ tokenizer/ # Saved tokenizer configs
βββ tests/
β βββ unit/
β β βββ test_data.py
β β βββ test_models.py
β β βββ test_pipeline.py
β βββ integration/
β βββ test_api.py
βββ examples/
β βββ sample_conversation.txt
β βββ sample_summary.txt
β βββ api_examples.py
βββ artifacts/ # Training artifacts and logs
βββ logs/
βββ metrics/
# Execute the full workflow (data β training β evaluation)
python main.pyfrom src.pipeline.prediction_pipeline import PredictionPipeline
# Initialize pipeline
predictor = PredictionPipeline(model_path="models/checkpoints/best-model")
# Summarize text
conversation = """
John: Hey, are we still meeting at 3pm today?
Sarah: Yes! I'll bring the project files.
John: Perfect. Should we invite Mike too?
Sarah: Good idea, I'll send him a message now.
"""
summary = predictor.predict(conversation)
print(f"Summary: {summary}")# Launch FastAPI server
uvicorn src.api:app --host 0.0.0.0 --port 8000 --reload
# Test the endpoint
curl -X POST "http://localhost:8000/summarize" \
-H "Content-Type: application/json" \
-d '{"text": "Your long text here..."}'- Source: Hugging Face Datasets
- Description: Messenger-style conversations with human-written summaries
- Size: ~16,000 conversation-summary pairs
- Task: Abstractive dialogue summarization
- Split: Train (14,732) / Validation (818) / Test (819)
Sample Entry:
{
"dialogue": "Amanda: I baked cookies. Do you want some?\nJerry: Sure!\nAmanda: I'll bring you tomorrow :-)",
"summary": "Amanda baked cookies and will bring Jerry some tomorrow."
}To use your own dataset, modify src/components/data_ingestion.py:
# Load custom CSV/JSON
custom_data = load_custom_dataset("path/to/your/data.csv")| Model | Size | Best For | Training Time* |
|---|---|---|---|
| google/pegasus-cnn_dailymail | 568M | News articles, formal text | ~4 hours |
| t5-small | 60M | Quick experiments, low resources | ~1 hour |
| facebook/bart-large-cnn | 406M | Balanced performance | ~3 hours |
| Falconsai/text_summarization | 60M | Dialogue, conversational text | ~1 hour |
*Approximate training time on SAMSum with single V100 GPU
# In config/config.yaml
model:
name: "google/pegasus-cnn_dailymail"
max_input_length: 512
max_target_length: 128# Using command line
python -m src.pipeline.training_pipeline \
--model_name google/pegasus-cnn_dailymail \
--dataset knkarthick/samsum \
--epochs 3 \
--batch_size 4 \
--output_dir models/checkpoints/pegasus-samsum
# Or use the shell script
bash scripts/train.sh# Evaluate on test set
python -m src.components.model_evaluation \
--model_path models/checkpoints/pegasus-samsum \
--dataset knkarthick/samsum
# Output: ROUGE-1, ROUGE-2, ROUGE-L scoresfrom src.pipeline.prediction_pipeline import BatchPrediction
# Process multiple texts
texts = [
"First conversation...",
"Second conversation...",
"Third conversation..."
]
batch_predictor = BatchPrediction("models/checkpoints/best-model")
summaries = batch_predictor.predict_batch(texts, batch_size=8)data:
dataset_name: knkarthick/samsum
train_batch_size: 4
eval_batch_size: 8
max_source_length: 512
max_target_length: 128
model:
name: google/pegasus-cnn_dailymail
training:
num_epochs: 3
learning_rate: 2e-5
warmup_steps: 500
weight_decay: 0.01
gradient_accumulation_steps: 4
paths:
data_dir: data/
model_dir: models/checkpoints/
log_dir: artifacts/logs/TrainingArguments:
num_train_epochs: 3
per_device_train_batch_size: 4
per_device_eval_batch_size: 8
warmup_steps: 500
learning_rate: 2e-5
evaluation_strategy: "epoch"
save_strategy: "epoch"
load_best_model_at_end: true
metric_for_best_model: "rouge1"- Configuration Setup: Define parameters in
config.yamlandparams.yaml - Entity Definition: Create configuration entities in
src/config/entity.py - Configuration Manager: Implement in
src/config/configuration.py - Component Development:
- Data Ingestion (
src/components/data_ingestion.py) - Data Transformation (
src/components/data_transformation.py) - Model Training (
src/components/model_trainer.py) - Model Evaluation (
src/components/model_evaluation.py)
- Data Ingestion (
- Pipeline Creation:
- Training Pipeline (
src/pipeline/training_pipeline.py) - Prediction Pipeline (
src/pipeline/prediction_pipeline.py)
- Training Pipeline (
- API Development: REST endpoints in
src/api.py - Testing: Unit and integration tests
- Deployment: Containerization and cloud deployment
Generate summary for input text.
Request:
{
"text": "Your long conversation or article here...",
"max_length": 128,
"min_length": 30
}Response:
{
"summary": "Concise summary of the input text.",
"model": "pegasus-samsum",
"processing_time": 0.45
}Process multiple texts at once.
Check API health status.
List available models.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Pegasus-CNN/DM (fine-tuned) | 42.5 | 20.8 | 34.2 |
| T5-small (fine-tuned) | 40.1 | 18.5 | 32.7 |
| BART-large-CNN (fine-tuned) | 43.2 | 21.4 | 35.1 |
# Build image
docker build -t text-summarizer:latest .
# Run container
docker run -p 8000:8000 \
-v $(pwd)/models:/app/models \
text-summarizer:latest
# Access API
curl http://localhost:8000/health- Core summarization pipeline
- SAMSum fine-tuning
- REST API implementation
- Support for CNN/DailyMail and XSum datasets
- Multi-model ensemble predictions
- Hugging Face Hub integration for model versioning
- Streamlit/Gradio web interface
- Kubernetes deployment configurations
- Real-time streaming summarization
- Multi-language support
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure:
- Code follows PEP 8 style guidelines
- All tests pass (
pytest tests/) - Documentation is updated
- Commit messages are descriptive
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face Transformers: Apache 2.0
- SAMSum Dataset: Non-commercial research purposes
- PEGASUS Model: Apache 2.0
- Hugging Face for the Transformers library
- SAMSum dataset creators
- The open-source NLP community
- Project Maintainer: Prakash Kantumutchu
- Email: k.prakashofficial@gmail.com
- GitHub: @your-username
- Issues: GitHub Issues
β If you find this project helpful, please consider giving it a star!