TubePulse-chrome extensions is a production-grade Machine Learning system designed to perform real-time sentiment analysis on YouTube comment streams.
Unlike simple analysis tools, TubePulse demonstrates a complete MLOps lifecycle, integrating automated data pipelines, rigorous experiment tracking, and containerized deployment. The system evolves raw unstructured text into actionable insights using a highly optimized LightGBM classifier, served via a scalable Flask API and consumed through a bespoke Chrome Extension interface.
TubePulse is built on a robust MLOps foundation to ensure reproducibility, scalability, and observability:
- Data Version Control (DVC): Manages datasets and preprocessing pipelines, ensuring that every model version can be traced back to the exact data snapshot used for training.
- Experiment Tracking (MLflow): Logs metrics, hyperparameters, and artifacts across 8+ experimental iterations, facilitating data-driven model selection.
- Model Registry: Centralized repository for versioning trained models, enabling seamless rollbacks and stage transitions (Staging -> Production).
- Containerized Serving: The inference engine is Dockerized for consistent deployment across environments.
- High-Performance Classification: Utilizes a LightGBM Gradient Boosting framework optimized for multi-class classification (Positive, Neutral, Negative).
- Advanced NLP Pipeline:
- Vectorization: Custom TF-IDF implementation with 1-3 n-grams for capturing context.
- Preprocessing: NLTK-based pipeline including Lemmatization, negation handling, and noise reduction.
- Imbalance Handling: Class weighting strategies to handle skewed sentiment distributions.
- Low-Latency Inference: Real-time prediction endpoint optimized for high-throughput comment processing.
- Scalable Backend: Flask-based REST API with CORS support, ready for microservices orchestration.
- Visual Analytics: Frontend dashboard rendering temporal sentiment trends and engagement distributions.
Tubepulse/
├── backend/
│ └── main.py # Flask API server with sentiment prediction endpoints
├── src/
│ ├── data/
│ │ ├── data_ingestion.py # Data loading and splitting from Reddit dataset
│ │ └── data_preprocessing.py # Text preprocessing pipeline
│ ├── model/
│ │ ├── model_building.py # LightGBM model training with hyperparameters
│ │ ├── model_evaluation.py # Model evaluation metrics
│ │ └── register_model.py # MLflow model registration
├── Frontend/
│ ├── popup.html # Chrome extension UI
│ ├── popup.js # Extension logic and API communication
│ └── manifest.json # Chrome extension configuration
├── notebooks/
│ ├── 1_Preprocessing_&_EDA.ipynb
│ ├── 2_experiment_1_baseline_model.ipynb
│ ├── 3_experiment_2_bow_tfidf.ipynb
│ ├── 4_experiment_3_tfidf_(1,3)_max_features.ipynb
│ ├── 5_experiment_4_handling_imbalanced_data.ipynb
│ ├── 6_experiment_5_xgboost_with_hpt.ipynb
│ ├── 7_experiment_6_lightgbm_detailed_hpt.ipynb
│ └── 8_stacking.ipynb
├── data/
│ ├── raw/
│ │ ├── train.csv # Training dataset from Reddit
│ │ └── test.csv # Test dataset
│ └── interim/
│ ├── train_processed.csv # Preprocessed training data
│ └── test_processed.csv # Preprocessed test data
├── params.yaml # Model hyperparameters configuration
├── dvc.yaml # DVC pipeline configuration
├── requirements.txt # Python dependencies
├── Dockerfile # Docker containerization
├── setup.py # Package setup configuration
└── README.md # Project documentation
- Framework: Flask 3.0.3
- Model: LightGBM 4.5.0
- Vectorization: TF-IDF (Scikit-learn)
- Text Processing: NLTK 3.9.1
- ML Tracking: MLflow 2.17.0
- Data Processing: Pandas 2.2.3, NumPy 2.1.2
- Chrome Extension API
- YouTube Data API v3
- Chart.js: For data visualization
- Vanilla JavaScript
- Containerization: Docker
- Version Control: Git, DVC (Data Version Control)
- Cloud Storage: AWS S3 (boto3)
- Python 3.11+
- Chrome browser
- Git
- Clone the repository
git clone https://github.com/yourusername/tubepulse.git
cd tubepulse- Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt
python -m nltk.downloader stopwords wordnet- Configure environment variables
export MLFLOW_TRACKING_URI="http://your-mlflow-server:5000"
export MODEL_VERSION="1"- Start the Flask server
python backend/main.pyThe API will be available at http://localhost:5000
- Navigate to
chrome://extensions/in your Chrome browser - Enable "Developer mode" (top-right corner)
- Click "Load unpacked"
- Select the Frontend directory
- The TubePulse extension will appear in your extension list
- Algorithm: LightGBM Classifier
- Task: Multi-class sentiment classification
- Classes:
- 1: Positive sentiment
- 0: Neutral sentiment
- -1: Negative sentiment
learning_rate: 0.09
max_depth: 20
n_estimators: 367
num_class: 3
metric: multi_logloss
class_weight: balanced
reg_alpha: 0.1
reg_lambda: 0.1- Vectorization: TF-IDF with n-grams (1-3)
- Max Features: 1000
- Text Preprocessing:
- Lowercase conversion
- Special character removal (preserving punctuation for sentiment)
- Lemmatization using WordNetLemmatizer
- Stopword removal (excluding: not, but, however, no, yet)
- Source: Reddit sentiment dataset
- Train/Test Split: 80-20
- Preprocessing: Handles missing values, duplicates, and empty strings
Raw Reddit Data → Load → Preprocess → Train/Test Split → Save
Processed Data → TF-IDF Vectorization → LightGBM Training → Model Registry (MLflow)
YouTube Comments → Preprocess → TF-IDF Transform → LightGBM Prediction → Sentiment Labels + Scores
Analyze sentiment for YouTube comments with timestamps.
Request:
{
"comments": [
{
"text": "Amazing video!",
"timestamp": "2:30"
},
{
"text": "Not impressed",
"timestamp": "5:15"
}
]
}Response:
{
"predictions": [
{
"sentiment": 1,
"timestamp": "2:30",
"score": [0.1, 0.2, 0.7]
},
{
"sentiment": -1,
"timestamp": "5:15",
"score": [0.6, 0.3, 0.1]
}
]
}Health check endpoint.
The project includes 8 progressive experiments documented in Jupyter notebooks:
- Preprocessing & EDA: Data exploration and analysis
- Baseline Model: Initial benchmark with basic features
- Bag of Words + TF-IDF: Feature engineering comparison
- TF-IDF (1,3) with Max Features: N-gram optimization
- Handling Imbalanced Data: Class weight balancing strategies
- XGBoost with HPT: Hyperparameter tuning with XGBoost
- LightGBM with Detailed HPT: Advanced hyperparameter optimization
- Stacking: Ensemble methods for improved predictions
Build and run with Docker:
# Build the Docker image
docker build -t tubepulse:latest .
# Run the container
docker run -p 5000:5000 \
-e MLFLOW_TRACKING_URI="http://your-mlflow:5000" \
-e MODEL_VERSION="1" \
tubepulse:latestRequired for production deployment:
MLFLOW_TRACKING_URI # MLflow server URI for model loading
MODEL_VERSION # Model version to load (default: "1")
YOUTUBE_API_KEY # YouTube Data API key (in Frontend config)The model evaluation includes:
- Accuracy, Precision, Recall, F1-Score
- Confusion matrices per class
- ROC-AUC curves
- Class-wise performance analysis
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Multi-language support beyond English
- Real-time comment streaming optimization
- Mobile app version
- Advanced visualization features
- User feedback loop for model improvement