This repository contains the training pipeline for the sentiment analysis model used in our REMLA project.
- It uses the lib-ml library for data preprocessing and saves the trained model (
sentiment_model_*.pkl) as a release artifact. - The training dataset can be found in
data/raw/a1_RestaurantReviews_HistoricDump.tsv. - The project now uses DVC (Data Version Control) to track data, models, and metrics.
Note
TL;DR:
- Clone the repository
git clone https://github.com/remla25-team21/model-training.git- Install the required dependencies
pip install -r requirements.txt- (Optional) Configure DVC remote storage (only needed if you want to push changes to the remote storage or if
dvc pulldoesn't work without authentication)
dvc remote modify storage --local gdrive_use_service_account true
dvc remote modify storage --local gdrive_service_account_json_file_path <path/to/file.json> # Replace with your Google Drive service account JSON file path- Pull the data from remote storage or download it directly (see Troubleshooting section if facing issues)
dvc pull- Run the pipeline
dvc repro- Run the test
pytest- Generate the coverage report
coverage run -m pytest
coverage report # Prints summary in terminal
coverage xml # Generates coverage.xml file in the root directoryInstall the required dependencies:
pip install -r requirements.txtThe training process is now divided into three stages using DVC:
- Preprocessing: Data preparation and feature extraction
- Training: Model training with hyperparameter tuning
- Evaluation: Model evaluation and metrics generation
To configure the DVC pipeline, run:
dvc remote modify storage --local gdrive_use_service_account true
dvc remote modify storage --local gdrive_service_account_json_file_path <path/to/file.json> # Replace with your Google Drive service account JSON file pathTo pull the data from the remote storage:
dvc pullTo run the complete pipeline:
dvc reproTo run a specific stage:
dvc repro <stage_name> # e.g., dvc repro preprocessTo view metrics:
dvc metrics showTo view all experiments:
dvc exp showFor more details on collaborating with DVC, refer to ./docs/dvc-ref.md.
If you encounter "This app is blocked" error during Google authentication when using DVC with Google Drive, you can download the dataset directly using one of these methods:
wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1mrWUgJlRCf_n_TbxPuuthJ9YsTBwGuRh' -O ./data/raw/a1_RestaurantReviews_HistoricDump.tsvInvoke-WebRequest -Uri "https://drive.google.com/uc?export=download&id=1mrWUgJlRCf_n_TbxPuuthJ9YsTBwGuRh" -OutFile "./data/raw/a1_RestaurantReviews_HistoricDump.tsv"After downloading the dataset directly, you can proceed with the pipeline by running:
dvc reproIf you prefer to run each stage manually:
# Preprocessing
python src/preprocess.py
# Training
python src/train.py
# Evaluation
python src/evaluate.pyThe pipeline produces the following artifacts:
preprocessed_data_*.pkl: Preprocessed data (features and labels)c1_BoW_Sentiment_Model_*.pkl: Text vectorizer modeltrained_model_*.pkl: Trained ML model before evaluationsentiment_model_*.pkl: Final ML model after evaluationmetrics_*.json: Model performance metrics
Linters help improve code quality by identifying errors, enforcing style rules, and spotting security issues without running the code.
- Pylint: Checks for coding errors and enforces standards.
- Flake8: Checks code style and complexity.
- Bandit: Scans for security vulnerabilities in Python code.
To run all linters and generate reports:
bash lint.shUse Git Bash as your terminal:
1. chmod +x lint.sh2. ./lint.sh| Category | Test Count | Automated? |
|---|---|---|
| Feature & Data | ✅ 5 | ✅ |
| Model Development | ✅ 5 | ✅ |
| ML Infrastructure | ✅ 2 | ✅ |
| Monitoring | ✅ 2 | ✅ |
| Mutamorphic Testing | ✅ 3 | ✅ |
| Preprocessing Module | ✅ 2 | ✅ |
| Training Module | ✅ 5 | ✅ |
| Evaluation Module | ✅ 4 | ✅ |
Final Score: 12/12