Skip to content

ImdataScientistSachin/NLP-Tutorial-Collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Natural Language Processing (NLP) - Comprehensive Tutorial Collection

Python NLTK TensorFlow Scikit-Learn License

A production-ready collection of 14 professionally documented NLP scripts covering fundamental to advanced techniques

FeaturesInstallationScriptsUsageContributing


📋 Table of Contents


🎯 Overview

This repository contains a comprehensive, production-ready collection of Natural Language Processing (NLP) scripts designed for both learning and practical application. Each script is meticulously documented with:

  • Executive summaries explaining the "what" and "why"
  • Step-by-step code walkthroughs with inline comments
  • Real-world applications and use cases
  • Technical deep-dives into algorithms and mathematics
  • Best practices for NLP pipelines

Perfect for:

  • 🎓 Students learning NLP fundamentals
  • 💼 Data Scientists preparing for interviews (FAANG-level)
  • 🔬 Researchers exploring NLP techniques
  • 👨‍💻 Developers building text processing applications

✨ Key Features

🔥 Comprehensive Coverage

  • Text Preprocessing: Regex, tokenization, stemming, lemmatization
  • Feature Engineering: Bag of Words (BoW), TF-IDF, N-Grams
  • Machine Learning: Text classification, spam detection, sentiment analysis
  • Deep Learning: Word embeddings (Word2Vec), neural networks with TensorFlow
  • Advanced NLP: Text summarization, Named Entity Recognition (NER), POS tagging

📚 Professional Documentation

  • Every script includes detailed comments explaining logic and rationale
  • Mathematical formulas and algorithmic explanations
  • Comparison of different approaches (e.g., stemming vs. lemmatization)
  • Performance considerations and optimization tips

🛠️ Production-Ready Code

  • Clean, modular, and reusable code structure
  • Error handling and edge case management
  • Efficient implementations using industry-standard libraries
  • Ready for integration into larger projects

🔧 Prerequisites

Required Knowledge

  • Basic Python programming (functions, loops, data structures)
  • Understanding of machine learning concepts (optional but helpful)
  • Familiarity with command line/terminal

System Requirements

  • Python: 3.7 or higher
  • RAM: Minimum 4GB (8GB recommended for deep learning scripts)
  • Storage: ~500MB for libraries and datasets

📦 Installation

1. Clone the Repository

git clone https://github.com/imdataScientistSachin/NLP.git
cd NLP

2. Create Virtual Environment (Recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Download NLTK Data

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

Required Libraries

nltk==3.8.1
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
tensorflow==2.13.0
beautifulsoup4==4.12.2
gensim==4.3.1
imbalanced-learn==0.11.0
matplotlib==3.7.2

📖 Script Catalog

🔤 Fundamentals: Text Processing

1️⃣ 01_reModule.py - Regular Expressions Mastery

Purpose: Master Python's re module for pattern matching and text manipulation

Key Concepts:

  • Substitution with re.sub()
  • Pattern matching with re.search()
  • Metacharacters: \d, \D, \w, \W, \s, \S
  • Anchors: ^ (start), $ (end)
  • Character classes: [a-z], [^rw]

Use Cases: Data cleaning, text anonymization, input validation


2️⃣ 02_regex.py - Advanced Regex Patterns

Purpose: Deep dive into complex regex patterns for real-world text processing

Highlights:

  • Email and URL extraction
  • Phone number validation
  • HTML tag removal
  • Advanced pattern matching

3️⃣ 03_April_NLTK.py - NLTK Fundamentals

Purpose: Comprehensive introduction to Natural Language Toolkit (NLTK)

Techniques Covered:

  • Tokenization: Sentence and word-level splitting
  • Stemming: Reducing words to root form (PorterStemmer)
  • Lemmatization: Dictionary-based word normalization
  • Stopword Removal: Filtering common words
  • POS Tagging: Part-of-speech identification
  • Named Entity Recognition (NER): Extracting entities (people, places, organizations)

Real-World Application: Building preprocessing pipelines for text classification


📊 Feature Engineering

4️⃣ 04_BOW.py - Bag of Words (BoW)

Purpose: Convert text into numerical features using word frequency

Algorithm:

  1. Tokenize text into words
  2. Build vocabulary of unique words
  3. Create feature vectors based on word presence/frequency
  4. Generate sparse matrix representation

Mathematical Foundation:

Vector(document) = [count(word1), count(word2), ..., count(wordN)]

Limitations: Ignores word order and semantics


5️⃣ 05_TF_IDF.py - Term Frequency-Inverse Document Frequency

Purpose: Advanced text vectorization weighing word importance

Algorithm:

  • TF (Term Frequency): TF(t,d) = count(t in d) / total_words(d)
  • IDF (Inverse Document Frequency): IDF(t) = log(N / df(t))
  • TF-IDF: TF-IDF(t,d) = TF(t,d) × IDF(t)

Advantages over BoW:

  • Down-weights common words (e.g., "the", "is")
  • Up-weights rare, distinctive terms
  • Better for document similarity and classification

Implementation: From-scratch implementation + scikit-learn comparison


6️⃣ 06_NGram.py - N-Gram Language Models

Purpose: Statistical language modeling and text generation

Concept:

  • Unigram (N=1): Single words
  • Bigram (N=2): Two-word sequences
  • Trigram (N=3): Three-word sequences

Algorithm:

  1. Build dictionary of N-word sequences → next word
  2. Use Markov assumption (next word depends only on previous N-1 words)
  3. Generate text by probabilistically selecting next words

Applications: Autocomplete, text generation, speech recognition

Limitations: Data sparsity, no long-range dependencies


🤖 Machine Learning Applications

7️⃣ 07_textClassification_pickle.py - Text Classification Pipeline

Purpose: End-to-end text classification with model persistence

Pipeline:

  1. Text preprocessing (cleaning, tokenization)
  2. Feature extraction (TF-IDF)
  3. Model training (Naive Bayes, SVM, Random Forest)
  4. Model serialization with pickle
  5. Prediction on new data

Key Techniques:

  • Train-test split
  • Cross-validation
  • Hyperparameter tuning
  • Model evaluation (accuracy, precision, recall, F1-score)

8️⃣ 08_Pipeline_LoadFiles.py - Scikit-Learn Pipelines

Purpose: Building reusable ML pipelines for text processing

Benefits:

  • Encapsulates preprocessing + model training
  • Prevents data leakage
  • Simplifies cross-validation
  • Easy deployment

Example Pipeline:

Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB())
])

9️⃣ 09_summarization.py - Extractive Text Summarization

Purpose: Automatic summarization using frequency-based ranking

Algorithm:

  1. Web Scraping: Fetch article using BeautifulSoup
  2. Preprocessing: Remove citations, clean text
  3. Word Frequency: Calculate normalized word counts
  4. Sentence Scoring: Rank sentences by sum of word frequencies
  5. Selection: Extract top N sentences

Approach: Extractive (selects existing sentences, no generation)

Use Cases: News summarization, document digests, content curation


🔟 10_smsClassification.py - Imbalanced Data Handling

Purpose: SMS spam detection with class imbalance solutions

Challenge: Imbalanced datasets (95% ham, 5% spam) → biased models

Solutions Implemented:

  • ADASYN: Adaptive Synthetic Sampling (focuses on hard-to-learn examples)
  • SMOTE: Synthetic Minority Over-sampling Technique

Pipeline:

Text → Custom Preprocessing → BoW → TF-IDF → ADASYN/SMOTE → Random Forest

Evaluation Metrics:

  • Confusion Matrix
  • Precision, Recall, F1-Score
  • ROC-AUC

1️⃣1️⃣ 11_Stock_Classification.py - Financial Text Analysis

Purpose: Classify stock market news sentiment

Domain: Financial NLP Techniques: Sentiment analysis, domain-specific preprocessing Applications: Algorithmic trading, market sentiment analysis


🧠 Deep Learning & Embeddings

1️⃣2️⃣ 12_preProcessing_NLP_DL_TF.py - Deep Learning Preprocessing

Purpose: Prepare text data for neural networks using TensorFlow/Keras

Key Techniques:

  • Tokenization: Convert text to integer sequences
  • Padding: Ensure uniform sequence length
  • Vocabulary Management: Handle out-of-vocabulary (OOV) words
  • Embedding Preparation: Set up for embedding layers

Configuration:

  • max_length: Maximum sequence length
  • vocab_size: Vocabulary size
  • oov_token: Token for unknown words
  • padding_type: 'pre' or 'post'
  • truncating_type: 'pre' or 'post'

1️⃣3️⃣ 13_gensim_word_Embedding.py - Word2Vec Embeddings

Purpose: Learn dense vector representations of words

Concept: Words with similar meanings have similar vector representations

Algorithm: Word2Vec (Skip-Gram model)

  • Predicts context words given a target word
  • Learns embeddings through neural network training

Mathematical Magic:

king - man + woman ≈ queen

Implementation:

  1. Web scraping (Wikipedia article)
  2. Text preprocessing
  3. Sentence tokenization
  4. Stopword removal
  5. Word2Vec training with Gensim
  6. Similarity queries

Parameters:

  • vector_size: Embedding dimensionality (e.g., 100)
  • window: Context window size
  • min_count: Minimum word frequency
  • workers: Parallel processing threads

Applications: Semantic search, document similarity, transfer learning


1️⃣4️⃣ 14_Sarcastic_json_IMP.py - Sarcasm Detection with Deep Learning

Purpose: Binary classification using neural networks

Architecture:

Input → Embedding → GlobalAveragePooling1D → Dense(24, ReLU) → Dense(1, Sigmoid)

Dataset: JSON format with sarcastic/non-sarcastic headlines

Training:

  • Loss: Binary Crossentropy
  • Optimizer: Adam
  • Metrics: Accuracy
  • Epochs: 30

Visualization: Training/validation accuracy and loss curves

Key Insight: GlobalAveragePooling1D averages embeddings across sequence length, creating fixed-size representation


💡 Usage Examples

Example 1: Text Preprocessing Pipeline

import nltk
import re

# Load script
from NLTK_03_April import *

# Your text
text = "Hello! This is a sample text for NLP preprocessing."

# Tokenize
sentences = nltk.sent_tokenize(text)
words = nltk.word_tokenize(text)

# Lemmatize
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in words]

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered = [w for w in lemmatized if w.lower() not in stop_words]

print(filtered)

Example 2: TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

# Documents
docs = [
    "Machine learning is amazing",
    "Deep learning is a subset of machine learning",
    "Natural language processing uses machine learning"
]

# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# View feature names
print(vectorizer.get_feature_names_out())
print(tfidf_matrix.toarray())

Example 3: Text Classification

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

# Build pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

# Train
pipeline.fit(X_train, y_train)

# Predict
predictions = pipeline.predict(X_test)

🏗️ Technical Architecture

Data Flow Diagram

Raw Text
    ↓
[Preprocessing]
    ├── Regex Cleaning (01, 02)
    ├── Tokenization (03)
    ├── Normalization (03)
    └── Stopword Removal (03)
    ↓
[Feature Engineering]
    ├── BoW (04)
    ├── TF-IDF (05)
    ├── N-Grams (06)
    └── Word Embeddings (13)
    ↓
[Machine Learning]
    ├── Classical ML (07, 08, 10, 11)
    └── Deep Learning (12, 14)
    ↓
[Applications]
    ├── Classification
    ├── Summarization (09)
    └── Text Generation (06)

Technology Stack

Layer Technologies
Core Language Python 3.7+
NLP Libraries NLTK, Gensim, spaCy
ML Frameworks Scikit-Learn, Imbalanced-Learn
Deep Learning TensorFlow 2.x, Keras
Data Processing NumPy, Pandas
Web Scraping BeautifulSoup4, urllib
Visualization Matplotlib

📁 Project Structure

NLP/
│
├── 01_reModule.py                    # Regular expressions fundamentals
├── 02_regex.py                       # Advanced regex patterns
├── 03_April_NLTK.py                  # NLTK comprehensive tutorial
├── 04_BOW.py                         # Bag of Words implementation
├── 05_TF_IDF.py                      # TF-IDF from scratch
├── 06_NGram.py                       # N-Gram language models
├── 07_textClassification_pickle.py   # Text classification + model persistence
├── 08_Pipeline_LoadFiles.py          # Scikit-Learn pipelines
├── 09_summarization.py               # Extractive text summarization
├── 10_smsClassification.py           # SMS spam detection (imbalanced data)
├── 11_Stock_Classification.py        # Financial sentiment analysis
├── 12_preProcessing_NLP_DL_TF.py     # Deep learning preprocessing
├── 13_gensim_word_Embedding.py       # Word2Vec embeddings
├── 14_Sarcastic_json_IMP.py          # Sarcasm detection (deep learning)
├── README.md                         # This file
└── requirements.txt                  # Python dependencies

🎓 Learning Path

Beginner Track (Weeks 1-2)

  1. Start with 01_reModule.py - Learn regex basics
  2. Move to 03_April_NLTK.py - Master NLTK fundamentals
  3. Understand 04_BOW.py - Grasp text vectorization

Intermediate Track (Weeks 3-4)

  1. Study 05_TF_IDF.py - Advanced feature engineering
  2. Explore 06_NGram.py - Language modeling
  3. Practice 07_textClassification_pickle.py - Build classifiers

Advanced Track (Weeks 5-6)

  1. Master 08_Pipeline_LoadFiles.py - Production pipelines
  2. Implement 09_summarization.py - Real-world application
  3. Tackle 10_smsClassification.py - Handle imbalanced data

Expert Track (Weeks 7-8)

  1. Deep dive into 13_gensim_word_Embedding.py - Word embeddings
  2. Build 14_Sarcastic_json_IMP.py - Neural networks
  3. Integrate 12_preProcessing_NLP_DL_TF.py - DL preprocessing

🤝 Contributing

Contributions are welcome! Here's how you can help:

Ways to Contribute

  • 🐛 Report bugs or issues
  • 💡 Suggest new features or scripts
  • 📝 Improve documentation
  • 🔧 Submit pull requests

Contribution Guidelines

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Code Standards

  • Follow PEP 8 style guide
  • Add comprehensive comments
  • Include docstrings for functions
  • Write unit tests for new features

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • ✅ Commercial use
  • ✅ Modification
  • ✅ Distribution
  • ✅ Private use
  • ❌ Liability
  • ❌ Warranty

📧 Contact

Your Name


🌟 Acknowledgments

  • NLTK Team - For the comprehensive NLP toolkit
  • Scikit-Learn Contributors - For robust ML algorithms
  • TensorFlow Team - For deep learning framework
  • Gensim Developers - For word embedding implementations
  • Stack Overflow Community - For invaluable debugging help

📊 Project Stats

GitHub stars GitHub forks GitHub watchers


⭐ If you found this helpful, please star the repository! ⭐

Made with ❤️ for the NLP Community

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages