Provider Segmentation

About

A complete data clustering internship project using K-Means, Hierarchical Clustering, DBSCAN, Spectral Clustering, and Gaussian Mixture Model (GMM) Clustering to segment service providers into different groups based on selected features and also comparing each model's best possible performance.

Key Principles

Data clustering is an unsupervised machine learning technique that organizes and classifies different objects, data points, or observations into groups or clusters based on similarities or patterns. Unlike supervised learning, clustering does not rely on labeled data and instead aims to find natural groupings within the data.

Clustering is used to identify underlying trends, patterns, and outliers in a dataset. It can be applied in various scenarios, such as exploratory data analysis, preprocessing, and anomaly detection. Clustering helps in reducing the complexity of large datasets by grouping similar data points together, which can simplify further analysis and visualization.

Clustering Pipeline

2nd Phase - Data Modeling & Analysis

Step 1 | Config
Step 2 | Setup
Step 3 | Data Load (mock with numeric + categorical, or external)
- Step 3.1 | Dataset Description
Step 4 | Feature Mapping (with optional one-hot encoding)
Step 5 | EDA
- Step 5.1 | Feature Visualization
Step 6 | Missing Values & Casting
- Imputation Methods Theory
Step 7 | Feature Engineering
Step 8 | Feature Selection
Step 9 | Outlier Handling
- Outlier Handling Theory
Step 10 | Scaling
- Step 10.1 | Compare Clustering Across Two Scalers
Step 11 | PCA
- Principal Component Analysis Theory
- Step 11.1 | 3D PCA Visualization
Step 12 | Clustering Models
Step 13 | Hyperparameter Sweeps
Step 14 | Validation Metrics
- Validation Metrics Theory
Step 15 | Visualization (scatter + centroids)
Step 16 | Cluster Profiling
- Step 16.1 | Cluster Characteristic visualization
- Step 16.2 | Cluster Naming (business interpretation)
Step 17 | Export artifacts
- Export Artifacts Theory

Main Process

Exploratory Data Analysis (EDA)
Data Preprocessing (Data Cleaning & Transformation)
Feature Engineering (Feature Extraction & Selection)
Data Preparation
K-Means Clustering
Hierarchical Clustering
DBSCAN
Spectral Clustering
Gaussian Mixture Model Clustering
Silhouette Score
Davies-Bouldin Index
Calinski-Harabasz Index

Getting Started

This project contains two Jupyter Notebooks that document the process and results of the internship work.
Since the dataset may be confidential, it is not included in this repository.

provider_segmentation_eda.ipynb – Exploratory Data Analysis (EDA) of provider-related data.
provider_segmentation_clustering.ipynb – Clustering process and resulting segmentation.

You can open these notebooks in Jupyter Notebook, JupyterLab, or Google Colab to review the workflow and outputs.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
notebooks		notebooks
outputs		outputs
src		src
visuals		visuals
workflow		workflow
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Provider Segmentation

About

Key Principles

Table of Contents

Clustering Pipeline

Main Process

Getting Started

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

FearlessFrench/provider-segmentation

Folders and files

Latest commit

History

Repository files navigation

Provider Segmentation

About

Key Principles

Table of Contents

Clustering Pipeline

Main Process

Getting Started

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages