Skip to content
This repository was archived by the owner on Sep 3, 2025. It is now read-only.

remla25-team21/model-training

Repository files navigation

model-training

Pylint Score

Coverage

ML Test Score

This repository contains the training pipeline for the sentiment analysis model used in our REMLA project.

  • It uses the lib-ml library for data preprocessing and saves the trained model (sentiment_model_*.pkl) as a release artifact.
  • The training dataset can be found in data/raw/a1_RestaurantReviews_HistoricDump.tsv.
  • The project now uses DVC (Data Version Control) to track data, models, and metrics.

Note

TL;DR:

  1. Clone the repository
git clone https://github.com/remla25-team21/model-training.git
  1. Install the required dependencies
pip install -r requirements.txt
  1. (Optional) Configure DVC remote storage (only needed if you want to push changes to the remote storage or if dvc pull doesn't work without authentication)
dvc remote modify storage --local gdrive_use_service_account true
dvc remote modify storage --local gdrive_service_account_json_file_path <path/to/file.json> # Replace with your Google Drive service account JSON file path
  1. Pull the data from remote storage or download it directly (see Troubleshooting section if facing issues)
dvc pull
  1. Run the pipeline
dvc repro
  1. Run the test
pytest
  1. Generate the coverage report
coverage run -m pytest
coverage report # Prints summary in terminal 
coverage xml # Generates coverage.xml file in the root directory

Dependencies

Install the required dependencies:

pip install -r requirements.txt

DVC Pipeline

The training process is now divided into three stages using DVC:

  1. Preprocessing: Data preparation and feature extraction
  2. Training: Model training with hyperparameter tuning
  3. Evaluation: Model evaluation and metrics generation

To configure the DVC pipeline, run:

dvc remote modify storage --local gdrive_use_service_account true
dvc remote modify storage --local gdrive_service_account_json_file_path <path/to/file.json>  # Replace with your Google Drive service account JSON file path

To pull the data from the remote storage:

dvc pull

To run the complete pipeline:

dvc repro

To run a specific stage:

dvc repro <stage_name>  # e.g., dvc repro preprocess

To view metrics:

dvc metrics show

To view all experiments:

dvc exp show

For more details on collaborating with DVC, refer to ./docs/dvc-ref.md.

Troubleshooting

Google Authentication Issues

If you encounter "This app is blocked" error during Google authentication when using DVC with Google Drive, you can download the dataset directly using one of these methods:

Linux/macOS

wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1mrWUgJlRCf_n_TbxPuuthJ9YsTBwGuRh' -O ./data/raw/a1_RestaurantReviews_HistoricDump.tsv

Windows (PowerShell)

Invoke-WebRequest -Uri "https://drive.google.com/uc?export=download&id=1mrWUgJlRCf_n_TbxPuuthJ9YsTBwGuRh" -OutFile "./data/raw/a1_RestaurantReviews_HistoricDump.tsv"

After downloading the dataset directly, you can proceed with the pipeline by running:

dvc repro

Manual Training

If you prefer to run each stage manually:

# Preprocessing
python src/preprocess.py

# Training
python src/train.py

# Evaluation
python src/evaluate.py

Pipeline Outputs

The pipeline produces the following artifacts:

  • preprocessed_data_*.pkl: Preprocessed data (features and labels)
  • c1_BoW_Sentiment_Model_*.pkl: Text vectorizer model
  • trained_model_*.pkl: Trained ML model before evaluation
  • sentiment_model_*.pkl: Final ML model after evaluation
  • metrics_*.json: Model performance metrics

Linters

Linters help improve code quality by identifying errors, enforcing style rules, and spotting security issues without running the code.

Linters Used

  • Pylint: Checks for coding errors and enforces standards.
  • Flake8: Checks code style and complexity.
  • Bandit: Scans for security vulnerabilities in Python code.

How to Run

To run all linters and generate reports:

For Mac/Linux

bash lint.sh

For Windows

Use Git Bash as your terminal:

1. chmod +x lint.sh
2. ./lint.sh

ML Test Score

Category Test Count Automated?
Feature & Data ✅ 5
Model Development ✅ 5
ML Infrastructure ✅ 2
Monitoring ✅ 2
Mutamorphic Testing ✅ 3
Preprocessing Module ✅ 2
Training Module ✅ 5
Evaluation Module ✅ 4

Final Score: 12/12

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 7