This project analyzes the Telco Customer Churn dataset to uncover why customers leave and how churn can be predicted.
The objective was to move from reactive customer service to a proactive retention strategy by identifying key churn drivers and enabling targeted interventions.
The workflow includes:
- Data cleaning and preparation
- Exploratory Data Analysis (EDA)
- Statistical validation and feature selection
- Model development and evaluation
- Business insights and recommendations
The dataset contains ~7,000 customer records with features across:
- Demographics: Gender, SeniorCitizen, Partner, Dependents
- Account Information: Tenure, Contract, PaperlessBilling, PaymentMethod
- Services: PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
- Billing: MonthlyCharges, TotalCharges
- Target Variable: Churn (Yes/No)
- Removed 11 missing values, 22 duplicates, and 10 outliers to maintain data integrity.
- Applied Label Encoding for binary variables and One-Hot Encoding for multi-class variables.
- Standardized numerical variables (
Tenure,MonthlyCharges,TotalCharges) using z-score scaling. - Performed a 70/30 stratified train-test split to preserve churn imbalance (~27% churners).
- Demographics: Senior Citizens churn at 41% vs. 24% for non-seniors; gender showed no impact.
- Services: Fiber optic users churn at 42%, DSL at 19%, no-internet at 7%. Lack of Tech Support (46%) or Online Security (42%) strongly linked to churn. Streaming services reduce churn.
- Contracts & Tenure: Month-to-month contracts churn at 42.6%, one-year at 11%, two-year at 2.8%. New customers (<6 months) churn the most.
- Billing & Payments: Paperless billing users churn more (34% vs. 20%). Electronic check is the riskiest method (45.1% churn) compared to stable methods (15β19%).
- Charges: Churners pay higher monthly fees (Median $74 vs. $61) and have shorter tenure (9 vs. 37 months).
TotalChargesadds little insight since it mirrors tenure Γ monthly charges.
- Chi-Square Test: Significant categorical predictors β Contract, PaymentMethod, TechSupport, OnlineSecurity (p < 0.001).
- Post-Hoc Tests:
- MultipleLines: βNoβ + βNo phone serviceβ (~25% churn) merged; βYesβ churns at 28.6%.
- PaymentMethod: Bank Transfer, Credit Card, Mailed Check (15β19%) grouped as Stable methods; Electronic check (45.1%) kept separate.
- Numerical Feature Tests: Both Welchβs t-test and MannβWhitney U-test showed highly significant differences (p < 0.001) for Tenure, MonthlyCharges, TotalCharges.
- Multicollinearity (VIF):
- Tenure (5.87) β moderate correlation.
- MonthlyCharges (3.25) β low correlation, unique signal.
- TotalCharges (9.57) β high collinearity; dropped for Logistic Regression only.
Implemented and tuned four models using GridSearchCV (3-fold cross-validation):
- Decision Tree β Best: gini, max_depth=6, min_samples_split=2, min_samples_leaf=2.
- Random Forest β Best: entropy, max_depth=8, min_samples_split=2, min_samples_leaf=4.
- KNN β Best: kd_tree, n_neighbors=8, weights=uniform.
- Logistic Regression β Used class_weight=balanced, max_iter=1000; dropped
TotalChargesto reduce multicollinearity.
| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| Random Forest | 0.812 | 0.733 | 0.458 | 0.564 | 0.872 |
| KNN | 0.802 | 0.630 | 0.607 | 0.618 | 0.856 |
| Logistic Regression | 0.807 | 0.669 | 0.534 | 0.594 | 0.847 |
| Decision Tree | 0.797 | 0.646 | 0.520 | 0.576 | 0.845 |
Key Insights & Recommendations:
- Random Forest β Best overall with highest Accuracy and ROC-AUC.
- KNN β Highest Recall; effective for identifying churners.
- Logistic Regression β Balanced and interpretable; ideal for explaining churn drivers.
- Decision Tree β Transparent baseline; less robust than ensembles.
π Best Use Cases:
- Use Random Forest for predictive performance.
- Use Logistic Regression when interpretability and business communication matter most.
- Use KNN when maximizing Recall (catching churners) is critical.
- Senior Citizens & Family Support: Higher churn among seniors (41%) and single customers (31β32%). Retention should prioritize these groups.
- Service Subscriptions: Lack of Tech Support/Online Security drives churn; bundling these services reduces risk. Streaming services improve retention.
- Contracts & Tenure: Month-to-month plans are unstable (42.6% churn). Incentives for long-term contracts can significantly reduce churn.
- Billing & Payments: Paperless billing users churn more (34% vs. 20%). Electronic check (45.1%) is the riskiest method; auto-pay incentives recommended.
- Charges & Loyalty: High monthly charges drive churn early, but long-tenure, high-spend customers remain loyal.
- Random Forest is the most reliable model for churn prediction.
- KNN is effective when maximizing Recall is the priority.
- Logistic Regression provides interpretability for business decision-making.
- Insights reveal who is at risk of churn and why, enabling targeted retention strategies.