Skip to content

An automated, real-time Machine Learning Operations (MLOps) pipeline designed to extract, analyze, and visualize sentiment from YouTube comments. This system bridges the gap between raw unstructured text and actionable insights by providing a browser-based dashboard for content creators and brands.

License

Notifications You must be signed in to change notification settings

Jaswanth-006/Tubepulse

Repository files navigation

TubePulse

TubePulse-chrome extensions is a production-grade Machine Learning system designed to perform real-time sentiment analysis on YouTube comment streams.

Unlike simple analysis tools, TubePulse demonstrates a complete MLOps lifecycle, integrating automated data pipelines, rigorous experiment tracking, and containerized deployment. The system evolves raw unstructured text into actionable insights using a highly optimized LightGBM classifier, served via a scalable Flask API and consumed through a bespoke Chrome Extension interface.

🏗️ MLOps Architecture

TubePulse is built on a robust MLOps foundation to ensure reproducibility, scalability, and observability:

  • Data Version Control (DVC): Manages datasets and preprocessing pipelines, ensuring that every model version can be traced back to the exact data snapshot used for training.
  • Experiment Tracking (MLflow): Logs metrics, hyperparameters, and artifacts across 8+ experimental iterations, facilitating data-driven model selection.
  • Model Registry: Centralized repository for versioning trained models, enabling seamless rollbacks and stage transitions (Staging -> Production).
  • Containerized Serving: The inference engine is Dockerized for consistent deployment across environments.

🌟 Key Technical Features

Machine Learning Core

  • High-Performance Classification: Utilizes a LightGBM Gradient Boosting framework optimized for multi-class classification (Positive, Neutral, Negative).
  • Advanced NLP Pipeline:
    • Vectorization: Custom TF-IDF implementation with 1-3 n-grams for capturing context.
    • Preprocessing: NLTK-based pipeline including Lemmatization, negation handling, and noise reduction.
    • Imbalance Handling: Class weighting strategies to handle skewed sentiment distributions.

Engineering & Operations

  • Low-Latency Inference: Real-time prediction endpoint optimized for high-throughput comment processing.
  • Scalable Backend: Flask-based REST API with CORS support, ready for microservices orchestration.
  • Visual Analytics: Frontend dashboard rendering temporal sentiment trends and engagement distributions.

📋 Project Structure

Tubepulse/
├── backend/
│   └── main.py                 # Flask API server with sentiment prediction endpoints
├── src/
│   ├── data/
│   │   ├── data_ingestion.py   # Data loading and splitting from Reddit dataset
│   │   └── data_preprocessing.py # Text preprocessing pipeline
│   ├── model/
│   │   ├── model_building.py   # LightGBM model training with hyperparameters
│   │   ├── model_evaluation.py # Model evaluation metrics
│   │   └── register_model.py   # MLflow model registration
├── Frontend/
│   ├── popup.html              # Chrome extension UI
│   ├── popup.js                # Extension logic and API communication
│   └── manifest.json           # Chrome extension configuration
├── notebooks/
│   ├── 1_Preprocessing_&_EDA.ipynb
│   ├── 2_experiment_1_baseline_model.ipynb
│   ├── 3_experiment_2_bow_tfidf.ipynb
│   ├── 4_experiment_3_tfidf_(1,3)_max_features.ipynb
│   ├── 5_experiment_4_handling_imbalanced_data.ipynb
│   ├── 6_experiment_5_xgboost_with_hpt.ipynb
│   ├── 7_experiment_6_lightgbm_detailed_hpt.ipynb
│   └── 8_stacking.ipynb
├── data/
│   ├── raw/
│   │   ├── train.csv           # Training dataset from Reddit
│   │   └── test.csv            # Test dataset
│   └── interim/
│       ├── train_processed.csv # Preprocessed training data
│       └── test_processed.csv  # Preprocessed test data
├── params.yaml                 # Model hyperparameters configuration
├── dvc.yaml                    # DVC pipeline configuration
├── requirements.txt            # Python dependencies
├── Dockerfile                  # Docker containerization
├── setup.py                    # Package setup configuration
└── README.md                   # Project documentation

🛠️ Tech Stack

Backend

  • Framework: Flask 3.0.3
  • Model: LightGBM 4.5.0
  • Vectorization: TF-IDF (Scikit-learn)
  • Text Processing: NLTK 3.9.1
  • ML Tracking: MLflow 2.17.0
  • Data Processing: Pandas 2.2.3, NumPy 2.1.2

Frontend

  • Chrome Extension API
  • YouTube Data API v3
  • Chart.js: For data visualization
  • Vanilla JavaScript

DevOps & Deployment

  • Containerization: Docker
  • Version Control: Git, DVC (Data Version Control)
  • Cloud Storage: AWS S3 (boto3)

🚀 Installation

Prerequisites

  • Python 3.11+
  • Chrome browser
  • Git

Backend Setup

  1. Clone the repository
git clone https://github.com/yourusername/tubepulse.git
cd tubepulse
  1. Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
python -m nltk.downloader stopwords wordnet
  1. Configure environment variables
export MLFLOW_TRACKING_URI="http://your-mlflow-server:5000"
export MODEL_VERSION="1"
  1. Start the Flask server
python backend/main.py

The API will be available at http://localhost:5000

Chrome Extension Setup

  1. Navigate to chrome://extensions/ in your Chrome browser
  2. Enable "Developer mode" (top-right corner)
  3. Click "Load unpacked"
  4. Select the Frontend directory
  5. The TubePulse extension will appear in your extension list

📊 Model Details

Architecture

  • Algorithm: LightGBM Classifier
  • Task: Multi-class sentiment classification
  • Classes:
    • 1: Positive sentiment
    • 0: Neutral sentiment
    • -1: Negative sentiment

Hyperparameters

learning_rate: 0.09
max_depth: 20
n_estimators: 367
num_class: 3
metric: multi_logloss
class_weight: balanced
reg_alpha: 0.1
reg_lambda: 0.1

Feature Engineering

  • Vectorization: TF-IDF with n-grams (1-3)
  • Max Features: 1000
  • Text Preprocessing:
    • Lowercase conversion
    • Special character removal (preserving punctuation for sentiment)
    • Lemmatization using WordNetLemmatizer
    • Stopword removal (excluding: not, but, however, no, yet)

Dataset

  • Source: Reddit sentiment dataset
  • Train/Test Split: 80-20
  • Preprocessing: Handles missing values, duplicates, and empty strings

🔄 Data Pipeline

Ingestion

Raw Reddit Data → Load → Preprocess → Train/Test Split → Save

Training

Processed Data → TF-IDF Vectorization → LightGBM Training → Model Registry (MLflow)

Inference

YouTube Comments → Preprocess → TF-IDF Transform → LightGBM Prediction → Sentiment Labels + Scores

📡 API Endpoints

POST /predict_with_timestamps

Analyze sentiment for YouTube comments with timestamps.

Request:

{
  "comments": [
    {
      "text": "Amazing video!",
      "timestamp": "2:30"
    },
    {
      "text": "Not impressed",
      "timestamp": "5:15"
    }
  ]
}

Response:

{
  "predictions": [
    {
      "sentiment": 1,
      "timestamp": "2:30",
      "score": [0.1, 0.2, 0.7]
    },
    {
      "sentiment": -1,
      "timestamp": "5:15",
      "score": [0.6, 0.3, 0.1]
    }
  ]
}

GET /

Health check endpoint.

🧪 Experiments & Model Evolution

The project includes 8 progressive experiments documented in Jupyter notebooks:

  1. Preprocessing & EDA: Data exploration and analysis
  2. Baseline Model: Initial benchmark with basic features
  3. Bag of Words + TF-IDF: Feature engineering comparison
  4. TF-IDF (1,3) with Max Features: N-gram optimization
  5. Handling Imbalanced Data: Class weight balancing strategies
  6. XGBoost with HPT: Hyperparameter tuning with XGBoost
  7. LightGBM with Detailed HPT: Advanced hyperparameter optimization
  8. Stacking: Ensemble methods for improved predictions

📦 Docker Deployment

Build and run with Docker:

# Build the Docker image
docker build -t tubepulse:latest .

# Run the container
docker run -p 5000:5000 \
  -e MLFLOW_TRACKING_URI="http://your-mlflow:5000" \
  -e MODEL_VERSION="1" \
  tubepulse:latest

🔐 Environment Variables

Required for production deployment:

MLFLOW_TRACKING_URI    # MLflow server URI for model loading
MODEL_VERSION          # Model version to load (default: "1")
YOUTUBE_API_KEY        # YouTube Data API key (in Frontend config)

📈 Performance Metrics

The model evaluation includes:

  • Accuracy, Precision, Recall, F1-Score
  • Confusion matrices per class
  • ROC-AUC curves
  • Class-wise performance analysis

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

🐛 Known Issues & Future Work

  • Multi-language support beyond English
  • Real-time comment streaming optimization
  • Mobile app version
  • Advanced visualization features
  • User feedback loop for model improvement

📚 References


About

An automated, real-time Machine Learning Operations (MLOps) pipeline designed to extract, analyze, and visualize sentiment from YouTube comments. This system bridges the gap between raw unstructured text and actionable insights by providing a browser-based dashboard for content creators and brands.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages