Skip to content

Commit 4990092

Browse files
authored
Merge pull request #173 from Chakravartinsamrat/PIYUSH
Create text_summarization.md
2 parents 5de2029 + 6b034db commit 4990092

File tree

1 file changed

+267
-0
lines changed

1 file changed

+267
-0
lines changed
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
2+
# 📜Text Summarization
3+
4+
### 🎯 AIM
5+
Develop a model to summarize long articles into short, concise summaries.
6+
7+
### 📊 DATASET LINK
8+
[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/)
9+
10+
### 📓 NOTEBOOK LINK
11+
??? Abstract "Kaggle Notebook"
12+
13+
<iframe src="https://www.kaggle.com/embed/piyushchakarborthy/text-summary-via-textrank-transformers-tf-idf?kernelSessionId=219171135" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="Text Summary Via TextRank, Transformers, TF-IDF"></iframe>
14+
### ⚙️ LIBRARIES NEEDED
15+
??? quote "LIBRARIES USED"
16+
17+
- pandas
18+
- numpy
19+
- scikit-learn
20+
- matplotlib
21+
- keras
22+
- tensorflow
23+
- spacy
24+
- pytextrank
25+
- TfidfVectorizer
26+
- Transformer (Bart)
27+
---
28+
29+
### 📝 DESCRIPTION
30+
31+
??? info "What is the requirement of the project?"
32+
- A robust system to summarize text efficiently is essential for handling large volumes of information.
33+
- It helps users quickly grasp key insights without reading lengthy documents.
34+
35+
??? info "Why is it necessary?"
36+
- Large amounts of text can be overwhelming and time-consuming to process.
37+
- Automated summarization improves productivity and aids decision-making in various fields like journalism, research, and customer support.
38+
39+
??? info "How is it beneficial and used?"
40+
- Provides a concise summary while preserving essential information.
41+
- Used in news aggregation, academic research, and AI-powered assistants for quick content consumption.
42+
43+
??? info "How did you start approaching this project? (Initial thoughts and planning)"
44+
- Explored different text summarization techniques, including extractive and abstractive methods.
45+
- Implemented models like TextRank, BART, and T5 to compare their effectiveness.
46+
47+
??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)."
48+
- Documentation from Hugging Face Transformers
49+
- Research Paper: "Text Summarization using Deep Learning"
50+
- Blog: "Introduction to NLP-based Summarization Techniques"
51+
52+
---
53+
## 🔍 EXPLANATION
54+
55+
#### 🧩 DETAILS OF THE DIFFERENT FEATURES
56+
57+
#### 📂 dataset.csv
58+
59+
The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries.
60+
61+
| Feature Name | Description |
62+
|--------------|-------------|
63+
| Id | A unique Id for each row |
64+
| Article | Entire article written on CNN Daily mail |
65+
| Highlights | Key Notes of the article |
66+
67+
#### 🛠 Developed Features
68+
69+
| Feature | Description |
70+
|----------------------|-------------------------------------------------|
71+
| `sentence_rank` | Rank of a sentence based on importance using TextRank |
72+
| `word_freq` | Frequency of key terms in the document |
73+
| `tf-idf_score` | Term Frequency-Inverse Document Frequency for words |
74+
| `summary_length` | Desired length of the summary |
75+
| `generated_summary` | AI-generated condensed version of the original text |
76+
77+
---
78+
### 🛤 PROJECT WORKFLOW
79+
!!! success "Project flowchart"
80+
81+
``` mermaid
82+
graph LR
83+
A[Start] --> B[Load Dataset]
84+
B --> C[Preprocessing]
85+
C --> D[TextRank + TF-IDF / Transformer Models]
86+
D --> E{Compare Performance}
87+
E -->|Best Model| F[Deploy]
88+
E -->|Retry| C;
89+
```
90+
91+
#### PROCEDURE
92+
93+
=== "Step 1"
94+
95+
Exploratory Data Analysis:
96+
97+
- Loaded the CNN/DailyMail dataset using pandas.
98+
- Explored dataset features like article and highlights, ensuring the correct format for summarization.
99+
- Analyzed the distribution of articles and their corresponding summaries.
100+
101+
=== "Step 2"
102+
103+
Data cleaning and preprocessing:
104+
105+
- Removed unnecessary columns (like id) and checked for missing values.
106+
- Tokenized articles into sentences and words, removing stopwords and special characters.
107+
- Preprocessed the text using basic NLP techniques such as lowercasing, lemmatization, and removing non-alphanumeric characters.
108+
109+
=== "Step 3"
110+
111+
Feature engineering and selection:
112+
113+
- For TextRank-based summarization, calculated sentence similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity.
114+
- Selected top-ranked sentences based on their importance and relevance to the article.
115+
- Applied transformers-based models like BART and T5 for abstractive summarization.
116+
- Applied transformers-based models like BART and T5 for abstractive summarization.
117+
118+
=== "Step 4"
119+
120+
Model training and evaluation:
121+
122+
- For the TextRank summarization approach, created a similarity matrix based on TF-IDF and Cosine Similarity.
123+
- For transformer-based methods, used Hugging Face's BART and T5 models, summarizing articles with their pre-trained weights.
124+
- Evaluated the summarization models based on BLEU, ROUGE, and Cosine Similarity metrics.
125+
126+
=== "Step 5"
127+
128+
Validation and testing:
129+
130+
- Tested both extractive and abstractive summarization models on unseen data to ensure generalizability.
131+
- Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance.
132+
---
133+
134+
### 🖥 CODE EXPLANATION
135+
<!-- Provide an explanation for your essential code, highlighting key sections and their functionalities. -->
136+
<!-- This will help beginners understand the core components and how they contribute to the overall project. -->
137+
138+
=== "TextRank algorithm"
139+
140+
Important Function:
141+
142+
graph = nx.from_numpy_array(similarity_matrix)
143+
scores = nx.pagerank(graph)
144+
145+
Example Input:
146+
similarity_matrix = np.array([
147+
[0.0, 0.2, 0.1], # Sentence 1
148+
[0.2, 0.0, 0.3], # Sentence 2
149+
[0.1, 0.3, 0.0]]) # Sentence 3
150+
151+
graph = nx.from_numpy_array(similarity_matrix)
152+
scores = nx.pagerank(graph)
153+
154+
Output:
155+
{0: 0.25, 1: 0.45, 2: 0.30} #That means sentence 2(0.45) has more importance than others
156+
157+
158+
159+
=== "Transformers"
160+
161+
Important Function:
162+
163+
pipeline("summarization") - Initializes a pre-trained transformer model for summarization.
164+
generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False)
165+
This Generates a summary using a transformer model.
166+
167+
Example Input:
168+
article = "The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972,
169+
with Apollo 11 being the first mission."
170+
171+
Output:
172+
The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972.
173+
Apollo 11 was the first mission.
174+
175+
176+
177+
178+
=== "TTF-IDF Algorithm"
179+
180+
Important Function:
181+
182+
vectorizer = TfidfVectorizer()
183+
tfidf_matrix = vectorizer.fit_transform(processed_sentences)
184+
185+
Example Input:
186+
processed_sentences = [
187+
"apollo program nasa initiative landed humans moon 1969 1972",
188+
"apollo 11 first mission land moon neil armstrong buzz aldrin walked surface",
189+
"apollo program significant achievement space exploration cold war space race"]
190+
191+
Output:
192+
['1969', '1972', 'achievement', 'aldrin', 'apollo', 'armstrong', 'buzz', 'cold', 'exploration',
193+
'first', 'humans', 'initiative', 'land', 'landed', 'moon', 'nasa', 'neil', 'program', 'race',
194+
'significant', 'space', 'surface', 'walked', 'war']
195+
196+
---
197+
198+
#### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS
199+
200+
=== "Trade-off 1"
201+
202+
Training Dataset being over 1.2Gb, which is too large for local machines.
203+
204+
- **Solution**: Instead of Training a model on train dataset, Used Test Dataset for training and validation.
205+
206+
=== "Trade-off 2"
207+
208+
Transformer models (BART/T5) required high computational resources and long inference times for summarizing large articles.
209+
210+
- **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance.
211+
212+
=== "Trade-off 3"
213+
214+
TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models.
215+
216+
- **Solution**: Combined TextRank and Transformer-based summarization models in a hybrid approach to leverage the best of both worlds—speed from TextRank and accuracy from transformers.
217+
218+
219+
---
220+
221+
### 🖼 SCREENSHOTS
222+
223+
??? example "Confusion Matrix"
224+
225+
=== "TF-IDF Confusion Matrix"
226+
![tfidf](https://github.com/user-attachments/assets/28f257e1-2529-48f1-81e5-e058a50fb351)
227+
228+
=== "TextRank Confusion Matrix"
229+
![textrank](https://github.com/user-attachments/assets/cb748eff-e4f3-4096-ab2b-cf2e4b40186f)
230+
231+
=== "Transformers Confusion Matrix"
232+
![trans](https://github.com/user-attachments/assets/7e99887b-e225-4dd0-802d-f1c2b0e89bef)
233+
234+
235+
### ✅CONCLUSION
236+
237+
#### 🔑 KEY LEARNINGS
238+
239+
!!! tip "Insights gained from the data"
240+
- Data Complexity: News articles vary in length and structure, requiring different summarization techniques.
241+
- Text Preprocessing: Cleaning text (e.g., stopword removal, tokenization) significantly improves summarization quality.
242+
- Feature Extraction: Techniques like TF-IDF, TextRank, and Transformer embeddings help in effective text representation for summarization models.
243+
244+
??? tip "Improvements in understanding machine learning concepts"
245+
- Model Selection: Comparing extractive (TextRank, TF-IDF) and abstractive (Transformers) models to determine the best summarization approach.
246+
247+
??? tip "Challenges faced and how they were overcome"
248+
- Long Text Processing: Splitting lengthy articles into manageable sections before summarization.
249+
- Computational Efficiency: Used batch processing and model optimization to handle large datasets efficiently.
250+
251+
---
252+
253+
#### 🌍 USE CASES
254+
255+
=== "Application 1"
256+
257+
**News Aggregation & Personalized Summaries**
258+
259+
- Automating news summarization helps users quickly grasp key events without reading lengthy articles.
260+
- Used in news apps, digital assistants, and content curation platforms.
261+
262+
=== "Application 2"
263+
264+
**Legal & Academic Document Summarization**
265+
266+
- Helps professionals extract critical insights from lengthy legal or research documents.
267+
- Reduces the time needed for manual reading and analysis.

0 commit comments

Comments
 (0)