|
| 1 | + |
| 2 | +# 📜Text Summarization |
| 3 | + |
| 4 | +### 🎯 AIM |
| 5 | +Develop a model to summarize long articles into short, concise summaries. |
| 6 | + |
| 7 | +### 📊 DATASET LINK |
| 8 | +[CNN DailyMail News Dataset](https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/) |
| 9 | + |
| 10 | +### 📓 NOTEBOOK LINK |
| 11 | +??? Abstract "Kaggle Notebook" |
| 12 | + |
| 13 | + <iframe src="https://www.kaggle.com/embed/piyushchakarborthy/text-summary-via-textrank-transformers-tf-idf?kernelSessionId=219171135" height="800" style="margin: 0 auto; width: 100%; max-width: 950px;" frameborder="0" scrolling="auto" title="Text Summary Via TextRank, Transformers, TF-IDF"></iframe> |
| 14 | +### ⚙️ LIBRARIES NEEDED |
| 15 | +??? quote "LIBRARIES USED" |
| 16 | + |
| 17 | + - pandas |
| 18 | + - numpy |
| 19 | + - scikit-learn |
| 20 | + - matplotlib |
| 21 | + - keras |
| 22 | + - tensorflow |
| 23 | + - spacy |
| 24 | + - pytextrank |
| 25 | + - TfidfVectorizer |
| 26 | + - Transformer (Bart) |
| 27 | +--- |
| 28 | + |
| 29 | +### 📝 DESCRIPTION |
| 30 | + |
| 31 | +??? info "What is the requirement of the project?" |
| 32 | + - A robust system to summarize text efficiently is essential for handling large volumes of information. |
| 33 | + - It helps users quickly grasp key insights without reading lengthy documents. |
| 34 | + |
| 35 | +??? info "Why is it necessary?" |
| 36 | + - Large amounts of text can be overwhelming and time-consuming to process. |
| 37 | + - Automated summarization improves productivity and aids decision-making in various fields like journalism, research, and customer support. |
| 38 | + |
| 39 | +??? info "How is it beneficial and used?" |
| 40 | + - Provides a concise summary while preserving essential information. |
| 41 | + - Used in news aggregation, academic research, and AI-powered assistants for quick content consumption. |
| 42 | + |
| 43 | +??? info "How did you start approaching this project? (Initial thoughts and planning)" |
| 44 | + - Explored different text summarization techniques, including extractive and abstractive methods. |
| 45 | + - Implemented models like TextRank, BART, and T5 to compare their effectiveness. |
| 46 | + |
| 47 | +??? info "Mention any additional resources used (blogs, books, chapters, articles, research papers, etc.)." |
| 48 | + - Documentation from Hugging Face Transformers |
| 49 | + - Research Paper: "Text Summarization using Deep Learning" |
| 50 | + - Blog: "Introduction to NLP-based Summarization Techniques" |
| 51 | + |
| 52 | +--- |
| 53 | +## 🔍 EXPLANATION |
| 54 | + |
| 55 | +#### 🧩 DETAILS OF THE DIFFERENT FEATURES |
| 56 | + |
| 57 | +#### 📂 dataset.csv |
| 58 | + |
| 59 | +The dataset contains features like sentence importance, word frequency, and linguistic structures that help in generating meaningful summaries. |
| 60 | + |
| 61 | +| Feature Name | Description | |
| 62 | +|--------------|-------------| |
| 63 | +| Id | A unique Id for each row | |
| 64 | +| Article | Entire article written on CNN Daily mail | |
| 65 | +| Highlights | Key Notes of the article | |
| 66 | + |
| 67 | +#### 🛠 Developed Features |
| 68 | + |
| 69 | +| Feature | Description | |
| 70 | +|----------------------|-------------------------------------------------| |
| 71 | +| `sentence_rank` | Rank of a sentence based on importance using TextRank | |
| 72 | +| `word_freq` | Frequency of key terms in the document | |
| 73 | +| `tf-idf_score` | Term Frequency-Inverse Document Frequency for words | |
| 74 | +| `summary_length` | Desired length of the summary | |
| 75 | +| `generated_summary` | AI-generated condensed version of the original text | |
| 76 | + |
| 77 | +--- |
| 78 | +### 🛤 PROJECT WORKFLOW |
| 79 | +!!! success "Project flowchart" |
| 80 | + |
| 81 | + ``` mermaid |
| 82 | + graph LR |
| 83 | + A[Start] --> B[Load Dataset] |
| 84 | + B --> C[Preprocessing] |
| 85 | + C --> D[TextRank + TF-IDF / Transformer Models] |
| 86 | + D --> E{Compare Performance} |
| 87 | + E -->|Best Model| F[Deploy] |
| 88 | + E -->|Retry| C; |
| 89 | + ``` |
| 90 | + |
| 91 | +#### PROCEDURE |
| 92 | + |
| 93 | +=== "Step 1" |
| 94 | + |
| 95 | + Exploratory Data Analysis: |
| 96 | + |
| 97 | + - Loaded the CNN/DailyMail dataset using pandas. |
| 98 | + - Explored dataset features like article and highlights, ensuring the correct format for summarization. |
| 99 | + - Analyzed the distribution of articles and their corresponding summaries. |
| 100 | + |
| 101 | +=== "Step 2" |
| 102 | + |
| 103 | + Data cleaning and preprocessing: |
| 104 | + |
| 105 | + - Removed unnecessary columns (like id) and checked for missing values. |
| 106 | + - Tokenized articles into sentences and words, removing stopwords and special characters. |
| 107 | + - Preprocessed the text using basic NLP techniques such as lowercasing, lemmatization, and removing non-alphanumeric characters. |
| 108 | + |
| 109 | +=== "Step 3" |
| 110 | + |
| 111 | + Feature engineering and selection: |
| 112 | + |
| 113 | + - For TextRank-based summarization, calculated sentence similarity using TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity. |
| 114 | + - Selected top-ranked sentences based on their importance and relevance to the article. |
| 115 | + - Applied transformers-based models like BART and T5 for abstractive summarization. |
| 116 | + - Applied transformers-based models like BART and T5 for abstractive summarization. |
| 117 | + |
| 118 | +=== "Step 4" |
| 119 | + |
| 120 | + Model training and evaluation: |
| 121 | + |
| 122 | + - For the TextRank summarization approach, created a similarity matrix based on TF-IDF and Cosine Similarity. |
| 123 | + - For transformer-based methods, used Hugging Face's BART and T5 models, summarizing articles with their pre-trained weights. |
| 124 | + - Evaluated the summarization models based on BLEU, ROUGE, and Cosine Similarity metrics. |
| 125 | + |
| 126 | +=== "Step 5" |
| 127 | + |
| 128 | + Validation and testing: |
| 129 | + |
| 130 | + - Tested both extractive and abstractive summarization models on unseen data to ensure generalizability. |
| 131 | + - Plotted confusion matrices to visualize True Positives, False Positives, and False Negatives, ensuring effective model performance. |
| 132 | +--- |
| 133 | + |
| 134 | +### 🖥 CODE EXPLANATION |
| 135 | +<!-- Provide an explanation for your essential code, highlighting key sections and their functionalities. --> |
| 136 | +<!-- This will help beginners understand the core components and how they contribute to the overall project. --> |
| 137 | + |
| 138 | +=== "TextRank algorithm" |
| 139 | + |
| 140 | + Important Function: |
| 141 | + |
| 142 | + graph = nx.from_numpy_array(similarity_matrix) |
| 143 | + scores = nx.pagerank(graph) |
| 144 | + |
| 145 | + Example Input: |
| 146 | + similarity_matrix = np.array([ |
| 147 | + [0.0, 0.2, 0.1], # Sentence 1 |
| 148 | + [0.2, 0.0, 0.3], # Sentence 2 |
| 149 | + [0.1, 0.3, 0.0]]) # Sentence 3 |
| 150 | + |
| 151 | + graph = nx.from_numpy_array(similarity_matrix) |
| 152 | + scores = nx.pagerank(graph) |
| 153 | + |
| 154 | + Output: |
| 155 | + {0: 0.25, 1: 0.45, 2: 0.30} #That means sentence 2(0.45) has more importance than others |
| 156 | + |
| 157 | + |
| 158 | + |
| 159 | +=== "Transformers" |
| 160 | + |
| 161 | + Important Function: |
| 162 | + |
| 163 | + pipeline("summarization") - Initializes a pre-trained transformer model for summarization. |
| 164 | + generated_summary = summarization_pipeline(article, max_length=150, min_length=50, do_sample=False) |
| 165 | + This Generates a summary using a transformer model. |
| 166 | + |
| 167 | + Example Input: |
| 168 | + article = "The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972, |
| 169 | + with Apollo 11 being the first mission." |
| 170 | + |
| 171 | + Output: |
| 172 | + The Apollo program was a NASA initiative that landed humans on the Moon between 1969 and 1972. |
| 173 | + Apollo 11 was the first mission. |
| 174 | + |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | +=== "TTF-IDF Algorithm" |
| 179 | + |
| 180 | + Important Function: |
| 181 | + |
| 182 | + vectorizer = TfidfVectorizer() |
| 183 | + tfidf_matrix = vectorizer.fit_transform(processed_sentences) |
| 184 | + |
| 185 | + Example Input: |
| 186 | + processed_sentences = [ |
| 187 | + "apollo program nasa initiative landed humans moon 1969 1972", |
| 188 | + "apollo 11 first mission land moon neil armstrong buzz aldrin walked surface", |
| 189 | + "apollo program significant achievement space exploration cold war space race"] |
| 190 | + |
| 191 | + Output: |
| 192 | + ['1969', '1972', 'achievement', 'aldrin', 'apollo', 'armstrong', 'buzz', 'cold', 'exploration', |
| 193 | + 'first', 'humans', 'initiative', 'land', 'landed', 'moon', 'nasa', 'neil', 'program', 'race', |
| 194 | + 'significant', 'space', 'surface', 'walked', 'war'] |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +#### ⚖️ PROJECT TRADE-OFFS AND SOLUTIONS |
| 199 | + |
| 200 | +=== "Trade-off 1" |
| 201 | + |
| 202 | + Training Dataset being over 1.2Gb, which is too large for local machines. |
| 203 | + |
| 204 | + - **Solution**: Instead of Training a model on train dataset, Used Test Dataset for training and validation. |
| 205 | + |
| 206 | +=== "Trade-off 2" |
| 207 | + |
| 208 | + Transformer models (BART/T5) required high computational resources and long inference times for summarizing large articles. |
| 209 | + |
| 210 | + - **Solution**: Model Pruning: Used smaller versions of transformer models (e.g., distilBART or distilT5) to reduce the computational load without compromising much on performance. |
| 211 | + |
| 212 | +=== "Trade-off 3" |
| 213 | + |
| 214 | + TextRank summary might miss nuances and context, leading to less accurate or overly simplistic outputs compared to transformer-based models. |
| 215 | + |
| 216 | + - **Solution**: Combined TextRank and Transformer-based summarization models in a hybrid approach to leverage the best of both worlds—speed from TextRank and accuracy from transformers. |
| 217 | + |
| 218 | + |
| 219 | +--- |
| 220 | + |
| 221 | +### 🖼 SCREENSHOTS |
| 222 | + |
| 223 | +??? example "Confusion Matrix" |
| 224 | + |
| 225 | + === "TF-IDF Confusion Matrix" |
| 226 | +  |
| 227 | + |
| 228 | + === "TextRank Confusion Matrix" |
| 229 | +  |
| 230 | + |
| 231 | + === "Transformers Confusion Matrix" |
| 232 | +  |
| 233 | + |
| 234 | + |
| 235 | +### ✅CONCLUSION |
| 236 | + |
| 237 | +#### 🔑 KEY LEARNINGS |
| 238 | + |
| 239 | +!!! tip "Insights gained from the data" |
| 240 | + - Data Complexity: News articles vary in length and structure, requiring different summarization techniques. |
| 241 | + - Text Preprocessing: Cleaning text (e.g., stopword removal, tokenization) significantly improves summarization quality. |
| 242 | + - Feature Extraction: Techniques like TF-IDF, TextRank, and Transformer embeddings help in effective text representation for summarization models. |
| 243 | + |
| 244 | +??? tip "Improvements in understanding machine learning concepts" |
| 245 | + - Model Selection: Comparing extractive (TextRank, TF-IDF) and abstractive (Transformers) models to determine the best summarization approach. |
| 246 | + |
| 247 | +??? tip "Challenges faced and how they were overcome" |
| 248 | + - Long Text Processing: Splitting lengthy articles into manageable sections before summarization. |
| 249 | + - Computational Efficiency: Used batch processing and model optimization to handle large datasets efficiently. |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +#### 🌍 USE CASES |
| 254 | + |
| 255 | +=== "Application 1" |
| 256 | + |
| 257 | + **News Aggregation & Personalized Summaries** |
| 258 | + |
| 259 | + - Automating news summarization helps users quickly grasp key events without reading lengthy articles. |
| 260 | + - Used in news apps, digital assistants, and content curation platforms. |
| 261 | + |
| 262 | +=== "Application 2" |
| 263 | + |
| 264 | + **Legal & Academic Document Summarization** |
| 265 | + |
| 266 | + - Helps professionals extract critical insights from lengthy legal or research documents. |
| 267 | + - Reduces the time needed for manual reading and analysis. |
0 commit comments