model-training

This repository contains the training pipeline for the sentiment analysis model used in our REMLA project.

It uses the lib-ml library for data preprocessing and saves the trained model (sentiment_model_*.pkl) as a release artifact.
The training dataset can be found in data/raw/a1_RestaurantReviews_HistoricDump.tsv.
The project now uses DVC (Data Version Control) to track data, models, and metrics.

Note

TL;DR:

Clone the repository

git clone https://github.com/remla25-team21/model-training.git

Install the required dependencies

pip install -r requirements.txt

(Optional) Configure DVC remote storage (only needed if you want to push changes to the remote storage or if dvc pull doesn't work without authentication)

dvc remote modify storage --local gdrive_use_service_account true
dvc remote modify storage --local gdrive_service_account_json_file_path <path/to/file.json> # Replace with your Google Drive service account JSON file path

Pull the data from remote storage or download it directly (see Troubleshooting section if facing issues)

dvc pull

Run the pipeline

dvc repro

Run the test

pytest

Generate the coverage report

coverage run -m pytest
coverage report # Prints summary in terminal 
coverage xml # Generates coverage.xml file in the root directory

Dependencies

Install the required dependencies:

pip install -r requirements.txt

DVC Pipeline

The training process is now divided into three stages using DVC:

Preprocessing: Data preparation and feature extraction
Training: Model training with hyperparameter tuning
Evaluation: Model evaluation and metrics generation

To configure the DVC pipeline, run:

dvc remote modify storage --local gdrive_use_service_account true
dvc remote modify storage --local gdrive_service_account_json_file_path <path/to/file.json>  # Replace with your Google Drive service account JSON file path

To pull the data from the remote storage:

dvc pull

To run the complete pipeline:

dvc repro

To run a specific stage:

dvc repro <stage_name>  # e.g., dvc repro preprocess

To view metrics:

dvc metrics show

To view all experiments:

dvc exp show

For more details on collaborating with DVC, refer to ./docs/dvc-ref.md.

Troubleshooting

Google Authentication Issues

If you encounter "This app is blocked" error during Google authentication when using DVC with Google Drive, you can download the dataset directly using one of these methods:

Linux/macOS

wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1mrWUgJlRCf_n_TbxPuuthJ9YsTBwGuRh' -O ./data/raw/a1_RestaurantReviews_HistoricDump.tsv

Windows (PowerShell)

Invoke-WebRequest -Uri "https://drive.google.com/uc?export=download&id=1mrWUgJlRCf_n_TbxPuuthJ9YsTBwGuRh" -OutFile "./data/raw/a1_RestaurantReviews_HistoricDump.tsv"

After downloading the dataset directly, you can proceed with the pipeline by running:

dvc repro

Manual Training

If you prefer to run each stage manually:

# Preprocessing
python src/preprocess.py

# Training
python src/train.py

# Evaluation
python src/evaluate.py

Pipeline Outputs

The pipeline produces the following artifacts:

preprocessed_data_*.pkl: Preprocessed data (features and labels)
c1_BoW_Sentiment_Model_*.pkl: Text vectorizer model
trained_model_*.pkl: Trained ML model before evaluation
sentiment_model_*.pkl: Final ML model after evaluation
metrics_*.json: Model performance metrics

Linters

Linters help improve code quality by identifying errors, enforcing style rules, and spotting security issues without running the code.

Linters Used

Pylint: Checks for coding errors and enforces standards.
Flake8: Checks code style and complexity.
Bandit: Scans for security vulnerabilities in Python code.

How to Run

To run all linters and generate reports:

For Mac/Linux

bash lint.sh

For Windows

Use Git Bash as your terminal:

1. chmod +x lint.sh

2. ./lint.sh

ML Test Score

Category	Test Count	Automated?
Feature & Data	✅ 5	✅
Model Development	✅ 5	✅
ML Infrastructure	✅ 2	✅
Monitoring	✅ 2	✅
Mutamorphic Testing	✅ 3	✅
Preprocessing Module	✅ 2	✅
Training Module	✅ 5	✅
Evaluation Module	✅ 4	✅

Final Score: 12/12

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data/raw		data/raw
docs		docs
src		src
tests		tests
.bandit		.bandit
.dvcignore		.dvcignore
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bandit_output.txt		bandit_output.txt
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
flake8_output.txt		flake8_output.txt
lint.sh		lint.sh
ml_smells_checker.py		ml_smells_checker.py
ml_test_score.py		ml_test_score.py
params.yaml		params.yaml
pylint_output.txt		pylint_output.txt
pylintrc		pylintrc
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

model-training

Dependencies

DVC Pipeline

Troubleshooting

Google Authentication Issues

Linux/macOS

Windows (PowerShell)

Manual Training

Pipeline Outputs

Linters

Linters Used

How to Run

For Mac/Linux

For Windows

ML Test Score

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

remla25-team21/model-training

Folders and files

Latest commit

History

Repository files navigation

model-training

Dependencies

DVC Pipeline

Troubleshooting

Google Authentication Issues

Linux/macOS

Windows (PowerShell)

Manual Training

Pipeline Outputs

Linters

Linters Used

How to Run

For Mac/Linux

For Windows

ML Test Score

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages