This project focuses on automatic summarization of Sustainability (SDG) Reports using advanced NLP models.
The goal is to condense lengthy corporate sustainability documents into concise, informative summaries that highlight key environmental, social, and governance (ESG) insights.
- Preprocessing pipeline for large and unstructured text data (PDF → cleaned text).
- Summarization models: Transformer-based architectures (e.g., T5, BART).
- Evaluation metrics: ROUGE, BLEU for measuring summary quality.
- Configurable parameters: summary length, model choice, evaluation scope.
SDG-Report-Summarization-AI/
│
├── data_preprocessing.py # Text cleaning and preparation
├── summarization.py # Summarization pipeline (T5/BART)
├── evaluation.py # Evaluation with ROUGE & BLEU
├── requirements.txt # Dependencies
└── README.md # Project documentation
git clone https://github.com/erenbg1/SDG-Report-Summarization-AI.git
cd SDG-Report-Summarization-AIpip install -r requirements.txtpython data_preprocessing.py --input data/raw_report.pdf --output data/cleaned_report.txtpython summarization.py --input data/cleaned_report.txt --model t5-small --max_length 300python evaluation.py --reference data/reference_summary.txt --candidate data/generated_summary.txtInput length: ~20,000 tokens
Generated summary length: ~500 tokens
"The company’s SDG strategy focuses primarily on reducing carbon emissions, improving supply chain transparency, and investing in community education projects…"
- Expand dataset with multi-company SDG reports.
- Fine-tune domain-specific summarization models.
- Add abstractive + extractive hybrid approach.
- Deploy as a web-based summarization tool (Flask/Streamlit).
This project is released under the MIT License.