ML DAG pipeline for predicting water potability based on various chemical properties

This project is a comprehensive machine learning DAG pipeline designed to predict the potability of water. It integrates various stages of the machine learning lifecycle, from data extraction and cleaning to model training and evaluation. The pipeline is built using DVC (Data Version Control) to manage data, models, and metrics efficiently.

DAG Pipeline

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.  │   
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Pipeline Stages

Data Collection
- Command: python src/extract_dataset.py
- Description: Extracts the raw data from zip files into the data/raw/extracted directory.
Data Preparation
- Command: python src/data_preparation.py
- Description: Cleans and preprocesses the extracted data, removing outliers and imputing missing values. Outputs cleaned data to data/interim/cleaned_data.csv.
Data Splitting
- Command: python src/data_splitting.py
- Description: Splits the cleaned data into training and testing sets. Saves the splits to data/processed/.
Model Training
- Command: python src/model_training.py
- Description: Trains a LightGBM model using the training data and saves the model to models/model.pkl.
Model Evaluation
- Command: python src/model_evaluation.py
- Description: Evaluates the trained model on the test data and saves the evaluation metrics to metrics/metrics.json.

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.dvc		.dvc
api		api
data		data
metrics		metrics
models		models
notebooks		notebooks
references		references
reports		reports
src		src
.dockerignore		.dockerignore
.dvcignore		.dvcignore
Dockerfile		Dockerfile
README.Docker.md		README.Docker.md
README.md		README.md
compose.yaml		compose.yaml
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML DAG pipeline for predicting water potability based on various chemical properties

DAG Pipeline

Project Organization

Pipeline Stages

About

Releases

Packages

Languages

ankitmishralive/ML-DAG-Pipeline

Folders and files

Latest commit

History

Repository files navigation

ML DAG pipeline for predicting water potability based on various chemical properties

DAG Pipeline

Project Organization

Pipeline Stages

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages