A complete end-to-end AI-based customer behavior segmentation system using real-world web analytics data from Google Analytics 4 (GA4) BigQuery public dataset.
This project segments 269,542 web users from the Google Merchandise Store into distinct behavioral groups using multiple unsupervised and supervised machine learning algorithms. The system can predict which segment any new web visitor belongs to in real time with 98.61% accuracy.
| Metric | Value |
|---|---|
| Total Users Segmented | 269,542 |
| Final Segments | 3 |
| Silhouette Score | 0.4622 |
| Classifier Accuracy | 98.61% |
| PCA Variance Retained | 91.53% |
| Clustering Algorithms | 4 (K-Means, DBSCAN, Hierarchical, GMM) |
AI-Customer-Behavior-Segmentation/
β
βββ π AI_Customer_Behavior_Segmentation_GA4_COLAB.ipynb β Main notebook
βββ π requirements.txt β Dependencies
βββ π README.md β This file
βββ π LICENSE β MIT License
βββ π .gitignore β Git ignore rules
β
βββ π docs/
β βββ project_report.pdf β Full project report
β βββ architecture.png β Pipeline diagram
β
βββ π sample_outputs/
βββ eda_distributions.png
βββ eda_funnel.png
βββ eda_correlation.png
βββ optimal_k.png
βββ silhouette.png
βββ umap.png
βββ tsne.png
βββ radar.png
βββ segment_dashboard.png
βββ feature_importance.png
βββ model_evaluation.png
βββ gmm_clustering.png
| Property | Details |
|---|---|
| Name | GA4 Obfuscated Sample Ecommerce |
| Source | bigquery-public-data.ga4_obfuscated_sample_ecommerce |
| Period | November 1, 2020 β January 31, 2021 (92 days) |
| Store | Google Merchandise Store |
| Events | page_view, scroll, view_item, add_to_cart, begin_checkout, purchase |
| Access | Free via Google BigQuery Sandbox |
| # | Technique | Type | Purpose |
|---|---|---|---|
| 1 | PCA | Dimensionality Reduction | 21 β 11 features, 91.53% variance retained |
| 2 | K-Means | Unsupervised Clustering | Primary segmentation algorithm |
| 3 | DBSCAN | Density-Based Clustering | Outlier detection + cluster confirmation |
| 4 | Agglomerative | Hierarchical Clustering | Dendrogram + cluster validation |
| 5 | GMM | Probabilistic Clustering | Soft assignments + uncertainty mapping |
| 6 | UMAP | Non-linear Reduction | 2D cluster visualization |
| 7 | t-SNE | Non-linear Reduction | Cluster separation validation |
| 8 | Random Forest | Supervised Classification | Real-time segment prediction |
| 9 | RFM Analysis | Statistical Scoring | Recency, Frequency, Monetary scoring |
- Regular visitors who never complete a purchase
- High session count and page views
- Strong add-to-cart behavior but zero conversions
- Strategy: Cart abandonment emails, 10% discount triggers, trust signals
- Premium segment driving 100% of revenue
- Average revenue $61.87 per user
- 55%+ conversion rate, 17+ pages per session
- Strategy: VIP loyalty program, personalized recommendations, early access
- Passive majority with minimal engagement
- Balanced but low values across all behavioral features
- 95%+ are new users with single sessions
- Strategy: Welcome popup, best-seller showcase, brand awareness campaigns
GA4 BigQuery Data (270,154 users)
β
Feature Engineering (21 behavioral features)
β
Log Transform + StandardScaler
β
PCA (21 β 11 dimensions, 91.53% variance)
β
βββββββββββββββββββββββββββββββββββββββ
β 4 Clustering Algorithms β
β βββ K-Means (silhouette 0.462)β
β βββ DBSCAN (3 clusters found)β
β βββ Hierarchical (silhouette 0.521)β
β βββ GMM (probabilistic) β
βββββββββββββββββββββββββββββββββββββββ
β
3 Segments confirmed by all algorithms
β
UMAP + t-SNE Visualization
β
Random Forest Classifier (98.61% accuracy)
β
βββββββββββββββββββββββββββββββββββββββ
β Prediction Engine β
β βββ Single user prediction β
β βββ Batch prediction (CSV) β
β βββ Confidence scores β
β βββ What-If simulator β
βββββββββββββββββββββββββββββββββββββββ
| Section | Description |
|---|---|
| 1 | Install Libraries |
| 2 | Google Authentication |
| 3 | Project Configuration |
| 4 | Data Extraction from BigQuery |
| 5 | Data Overview & Schema |
| 6 | Exploratory Data Analysis (4 charts) |
| 7 | Data Cleaning & Preprocessing |
| 8 | Feature Engineering (21 features) |
| 9 | RFM Analysis |
| 10 | PCA Dimensionality Reduction |
| 11 | Finding Optimal K (Elbow + Silhouette + DB + CH) |
| 12 | K-Means Clustering |
| 13 | DBSCAN Clustering |
| 14 | Agglomerative Hierarchical Clustering |
| 15 | UMAP & t-SNE Visualization |
| 16 | Cluster Profiling & Segment Labeling |
| 17 | Segment Dashboard (6 charts) |
| 18 | Random Forest Classifier |
| 19 | Model Evaluation |
| 20 | Feature Importance |
| 21 | Final Report & Marketing Recommendations |
| + | GMM Clustering |
| + | K=3 vs K=4 vs K=5 Comparison |
| + | Prediction Engine (5 cells) |
- Google Account
- Google Cloud Platform project (free)
- BigQuery API enabled
Click the badge above or go to colab.research.google.com and upload the notebook.
When Section 2 runs, click the Google login link and sign in with your GCP account.
In Section 3, replace:
PROJECT_ID = "YOUR_GCP_PROJECT_ID"Find your project ID at console.cloud.google.com.
Runtime β Run All
The notebook handles everything automatically. BigQuery results are cached after the first run.
After completion, download from the Colab Files panel:
/content/rf_segment_classifier.pkl
/content/scaler.pkl
/content/pca.pkl
/content/ga4_segmentation_results.csv
/content/segment_report.csv
/content/batch_predictions.csv
# Install in Colab (only these 3 needed β rest are pre-installed)
umap-learn==0.5.7
xgboost==2.1.3
plotly==5.24.1
# Pre-installed on Colab
pandas, numpy, scikit-learn, matplotlib, seaborn
scipy, joblib, pyarrow, google-cloud-bigquerySessions β Page Views : 0.1% drop
Page Views β Product Views : 77.3% drop β Critical gap
Product Views β Add to Cart : 79.5% drop β Critical gap
Add to Cart β Checkout : 22.6% drop
Checkout β Purchase : 54.5% drop
Only 1.6% of users (4,419 out of 270,154) completed a purchase.
1. conversion_rate (0.1375) β strongest behavioral signal
2. total_items_purchased (0.1370) β purchase depth
3. total_pageviews (0.1133) β browsing volume
4. purchase_count (0.1053) β buying frequency
5. total_revenue (0.0975) β monetary value
- Device type has near-zero importance for segmentation
- Traffic channel has near-zero importance for segmentation
- What users DO on the site matters far more than how they arrived
| Segment | Priority | Key Actions |
|---|---|---|
| High-Value Buyers | π’ RETAIN | VIP loyalty program, personalized recommendations |
| Engaged Non-Converters | π΅ CONVERT | Cart abandonment emails, trust signals, retargeting |
| Casual Shoppers | βͺ NURTURE | Welcome popup, best-sellers, brand awareness |
This project is licensed under the MIT License β see the LICENSE file for details.
- Google for the GA4 BigQuery public dataset
- Scikit-learn for clustering and classification algorithms
- UMAP-learn for dimensionality reduction visualization
- Plotly for interactive visualizations
Built with β€οΈ using Google Colab + BigQuery + Python