A complete data clustering internship project using K-Means, Hierarchical Clustering, DBSCAN, Spectral Clustering, and Gaussian Mixture Model (GMM) Clustering to segment service providers into different groups based on selected features and also comparing each model's best possible performance.
Data clustering is an unsupervised machine learning technique that organizes and classifies different objects, data points, or observations into groups or clusters based on similarities or patterns. Unlike supervised learning, clustering does not rely on labeled data and instead aims to find natural groupings within the data.
Clustering is used to identify underlying trends, patterns, and outliers in a dataset. It can be applied in various scenarios, such as exploratory data analysis, preprocessing, and anomaly detection. Clustering helps in reducing the complexity of large datasets by grouping similar data points together, which can simplify further analysis and visualization.
1st Phase - Exploratory Data Analysis (Data Cleaing & Transformation + Feature Engineering)
- Step 1 | Setup and Initialization
- Step 2 | Exploratory Data Analysis
- Step 3 | Data Cleaning & Transformation
- Step 4 | Feature Engineering
- Step 5 | Data Preparation for Clustering Models
2nd Phase - Data Modeling & Analysis
- Step 1 | Config
- Step 2 | Setup
- Step 3 | Data Load (mock with numeric + categorical, or external)
- Step 4 | Feature Mapping (with optional one-hot encoding)
- Step 5 | EDA
- Step 6 | Missing Values & Casting
- Step 7 | Feature Engineering
- Step 8 | Feature Selection
- Step 9 | Outlier Handling
- Step 10 | Scaling
- Step 11 | PCA
- Step 12 | Clustering Models
- Step 13 | Hyperparameter Sweeps
- Step 14 | Validation Metrics
- Step 15 | Visualization (scatter + centroids)
- Step 16 | Cluster Profiling
- Step 17 | Export artifacts
- Exploratory Data Analysis (EDA)
- Data Preprocessing (Data Cleaning & Transformation)
- Feature Engineering (Feature Extraction & Selection)
- Data Preparation
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Spectral Clustering
- Gaussian Mixture Model Clustering
- Silhouette Score
- Davies-Bouldin Index
- Calinski-Harabasz Index
This project contains two Jupyter Notebooks that document the process and results of the internship work.
Since the dataset may be confidential, it is not included in this repository.
provider_segmentation_eda.ipynb
– Exploratory Data Analysis (EDA) of provider-related data.provider_segmentation_clustering.ipynb
– Clustering process and resulting segmentation.
You can open these notebooks in Jupyter Notebook, JupyterLab, or Google Colab to review the workflow and outputs.
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.