TubePulse

TubePulse-chrome extensions is a production-grade Machine Learning system designed to perform real-time sentiment analysis on YouTube comment streams.

Unlike simple analysis tools, TubePulse demonstrates a complete MLOps lifecycle, integrating automated data pipelines, rigorous experiment tracking, and containerized deployment. The system evolves raw unstructured text into actionable insights using a highly optimized LightGBM classifier, served via a scalable Flask API and consumed through a bespoke Chrome Extension interface.

🏗️ MLOps Architecture

TubePulse is built on a robust MLOps foundation to ensure reproducibility, scalability, and observability:

Data Version Control (DVC): Manages datasets and preprocessing pipelines, ensuring that every model version can be traced back to the exact data snapshot used for training.
Experiment Tracking (MLflow): Logs metrics, hyperparameters, and artifacts across 8+ experimental iterations, facilitating data-driven model selection.
Model Registry: Centralized repository for versioning trained models, enabling seamless rollbacks and stage transitions (Staging -> Production).
Containerized Serving: The inference engine is Dockerized for consistent deployment across environments.

🌟 Key Technical Features

Machine Learning Core

High-Performance Classification: Utilizes a LightGBM Gradient Boosting framework optimized for multi-class classification (Positive, Neutral, Negative).
Advanced NLP Pipeline:
- Vectorization: Custom TF-IDF implementation with 1-3 n-grams for capturing context.
- Preprocessing: NLTK-based pipeline including Lemmatization, negation handling, and noise reduction.
- Imbalance Handling: Class weighting strategies to handle skewed sentiment distributions.

Engineering & Operations

Low-Latency Inference: Real-time prediction endpoint optimized for high-throughput comment processing.
Scalable Backend: Flask-based REST API with CORS support, ready for microservices orchestration.
Visual Analytics: Frontend dashboard rendering temporal sentiment trends and engagement distributions.

📋 Project Structure

Tubepulse/
├── backend/
│   └── main.py                 # Flask API server with sentiment prediction endpoints
├── src/
│   ├── data/
│   │   ├── data_ingestion.py   # Data loading and splitting from Reddit dataset
│   │   └── data_preprocessing.py # Text preprocessing pipeline
│   ├── model/
│   │   ├── model_building.py   # LightGBM model training with hyperparameters
│   │   ├── model_evaluation.py # Model evaluation metrics
│   │   └── register_model.py   # MLflow model registration
├── Frontend/
│   ├── popup.html              # Chrome extension UI
│   ├── popup.js                # Extension logic and API communication
│   └── manifest.json           # Chrome extension configuration
├── notebooks/
│   ├── 1_Preprocessing_&_EDA.ipynb
│   ├── 2_experiment_1_baseline_model.ipynb
│   ├── 3_experiment_2_bow_tfidf.ipynb
│   ├── 4_experiment_3_tfidf_(1,3)_max_features.ipynb
│   ├── 5_experiment_4_handling_imbalanced_data.ipynb
│   ├── 6_experiment_5_xgboost_with_hpt.ipynb
│   ├── 7_experiment_6_lightgbm_detailed_hpt.ipynb
│   └── 8_stacking.ipynb
├── data/
│   ├── raw/
│   │   ├── train.csv           # Training dataset from Reddit
│   │   └── test.csv            # Test dataset
│   └── interim/
│       ├── train_processed.csv # Preprocessed training data
│       └── test_processed.csv  # Preprocessed test data
├── params.yaml                 # Model hyperparameters configuration
├── dvc.yaml                    # DVC pipeline configuration
├── requirements.txt            # Python dependencies
├── Dockerfile                  # Docker containerization
├── setup.py                    # Package setup configuration
└── README.md                   # Project documentation

🛠️ Tech Stack

Backend

Framework: Flask 3.0.3
Model: LightGBM 4.5.0
Vectorization: TF-IDF (Scikit-learn)
Text Processing: NLTK 3.9.1
ML Tracking: MLflow 2.17.0
Data Processing: Pandas 2.2.3, NumPy 2.1.2

Frontend

Chrome Extension API
YouTube Data API v3
Chart.js: For data visualization
Vanilla JavaScript

DevOps & Deployment

Containerization: Docker
Version Control: Git, DVC (Data Version Control)
Cloud Storage: AWS S3 (boto3)

🚀 Installation

Prerequisites

Python 3.11+
Chrome browser
Git

Backend Setup

Clone the repository

git clone https://github.com/yourusername/tubepulse.git
cd tubepulse

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies

pip install -r requirements.txt
python -m nltk.downloader stopwords wordnet

Configure environment variables

export MLFLOW_TRACKING_URI="http://your-mlflow-server:5000"
export MODEL_VERSION="1"

Start the Flask server

python backend/main.py

The API will be available at http://localhost:5000

Chrome Extension Setup

Navigate to chrome://extensions/ in your Chrome browser
Enable "Developer mode" (top-right corner)
Click "Load unpacked"
Select the Frontend directory
The TubePulse extension will appear in your extension list

📊 Model Details

Architecture

Algorithm: LightGBM Classifier
Task: Multi-class sentiment classification
Classes:
- 1: Positive sentiment
- 0: Neutral sentiment
- -1: Negative sentiment

Hyperparameters

learning_rate: 0.09
max_depth: 20
n_estimators: 367
num_class: 3
metric: multi_logloss
class_weight: balanced
reg_alpha: 0.1
reg_lambda: 0.1

Feature Engineering

Vectorization: TF-IDF with n-grams (1-3)
Max Features: 1000
Text Preprocessing:
- Lowercase conversion
- Special character removal (preserving punctuation for sentiment)
- Lemmatization using WordNetLemmatizer
- Stopword removal (excluding: not, but, however, no, yet)

Dataset

Source: Reddit sentiment dataset
Train/Test Split: 80-20
Preprocessing: Handles missing values, duplicates, and empty strings

🔄 Data Pipeline

Ingestion

Raw Reddit Data → Load → Preprocess → Train/Test Split → Save

Training

Processed Data → TF-IDF Vectorization → LightGBM Training → Model Registry (MLflow)

Inference

YouTube Comments → Preprocess → TF-IDF Transform → LightGBM Prediction → Sentiment Labels + Scores

📡 API Endpoints

`POST /predict_with_timestamps`

Analyze sentiment for YouTube comments with timestamps.

Request:

{
  "comments": [
    {
      "text": "Amazing video!",
      "timestamp": "2:30"
    },
    {
      "text": "Not impressed",
      "timestamp": "5:15"
    }
  ]
}

Response:

{
  "predictions": [
    {
      "sentiment": 1,
      "timestamp": "2:30",
      "score": [0.1, 0.2, 0.7]
    },
    {
      "sentiment": -1,
      "timestamp": "5:15",
      "score": [0.6, 0.3, 0.1]
    }
  ]
}

`GET /`

Health check endpoint.

🧪 Experiments & Model Evolution

The project includes 8 progressive experiments documented in Jupyter notebooks:

Preprocessing & EDA: Data exploration and analysis
Baseline Model: Initial benchmark with basic features
Bag of Words + TF-IDF: Feature engineering comparison
TF-IDF (1,3) with Max Features: N-gram optimization
Handling Imbalanced Data: Class weight balancing strategies
XGBoost with HPT: Hyperparameter tuning with XGBoost
LightGBM with Detailed HPT: Advanced hyperparameter optimization
Stacking: Ensemble methods for improved predictions

📦 Docker Deployment

Build and run with Docker:

# Build the Docker image
docker build -t tubepulse:latest .

# Run the container
docker run -p 5000:5000 \
  -e MLFLOW_TRACKING_URI="http://your-mlflow:5000" \
  -e MODEL_VERSION="1" \
  tubepulse:latest

🔐 Environment Variables

Required for production deployment:

MLFLOW_TRACKING_URI    # MLflow server URI for model loading
MODEL_VERSION          # Model version to load (default: "1")
YOUTUBE_API_KEY        # YouTube Data API key (in Frontend config)

📈 Performance Metrics

The model evaluation includes:

Accuracy, Precision, Recall, F1-Score
Confusion matrices per class
ROC-AUC curves
Class-wise performance analysis

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

🐛 Known Issues & Future Work

Multi-language support beyond English
Real-time comment streaming optimization
Mobile app version
Advanced visualization features
User feedback loop for model improvement

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.dvc		.dvc
.github/workflows		.github/workflows
Frontend		Frontend
backend		backend
notebooks		notebooks
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__.init.py		__.init.py
confusion_matrix_Test Data.png		confusion_matrix_Test Data.png
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
lgbm_model.pkl		lgbm_model.pkl
params.yaml		params.yaml
requirements.txt		requirements.txt
setup.py		setup.py
tfidf_vectorizer.pkl		tfidf_vectorizer.pkl

License

Jaswanth-006/Tubepulse

Folders and files

Latest commit

History

Repository files navigation