Skip to content

HaswathaSridharan/Telecom_Customer_Churn_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Telecom Customer Churn Prediction

πŸ“Œ Project Overview

This project analyzes the Telco Customer Churn dataset to uncover why customers leave and how churn can be predicted.
The objective was to move from reactive customer service to a proactive retention strategy by identifying key churn drivers and enabling targeted interventions.

The workflow includes:

  • Data cleaning and preparation
  • Exploratory Data Analysis (EDA)
  • Statistical validation and feature selection
  • Model development and evaluation
  • Business insights and recommendations

πŸ“Š Dataset

The dataset contains ~7,000 customer records with features across:

  • Demographics: Gender, SeniorCitizen, Partner, Dependents
  • Account Information: Tenure, Contract, PaperlessBilling, PaymentMethod
  • Services: PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies
  • Billing: MonthlyCharges, TotalCharges
  • Target Variable: Churn (Yes/No)

πŸ”Ž Methodology

1. Data Preparation

  • Removed 11 missing values, 22 duplicates, and 10 outliers to maintain data integrity.
  • Applied Label Encoding for binary variables and One-Hot Encoding for multi-class variables.
  • Standardized numerical variables (Tenure, MonthlyCharges, TotalCharges) using z-score scaling.
  • Performed a 70/30 stratified train-test split to preserve churn imbalance (~27% churners).

2. Exploratory Data Analysis (EDA)

  • Demographics: Senior Citizens churn at 41% vs. 24% for non-seniors; gender showed no impact.
  • Services: Fiber optic users churn at 42%, DSL at 19%, no-internet at 7%. Lack of Tech Support (46%) or Online Security (42%) strongly linked to churn. Streaming services reduce churn.
  • Contracts & Tenure: Month-to-month contracts churn at 42.6%, one-year at 11%, two-year at 2.8%. New customers (<6 months) churn the most.
  • Billing & Payments: Paperless billing users churn more (34% vs. 20%). Electronic check is the riskiest method (45.1% churn) compared to stable methods (15–19%).
  • Charges: Churners pay higher monthly fees (Median $74 vs. $61) and have shorter tenure (9 vs. 37 months). TotalCharges adds little insight since it mirrors tenure Γ— monthly charges.

3. Statistical Validation

  • Chi-Square Test: Significant categorical predictors β†’ Contract, PaymentMethod, TechSupport, OnlineSecurity (p < 0.001).
  • Post-Hoc Tests:
    • MultipleLines: β€œNo” + β€œNo phone service” (~25% churn) merged; β€œYes” churns at 28.6%.
    • PaymentMethod: Bank Transfer, Credit Card, Mailed Check (15–19%) grouped as Stable methods; Electronic check (45.1%) kept separate.
  • Numerical Feature Tests: Both Welch’s t-test and Mann–Whitney U-test showed highly significant differences (p < 0.001) for Tenure, MonthlyCharges, TotalCharges.
  • Multicollinearity (VIF):
    • Tenure (5.87) β†’ moderate correlation.
    • MonthlyCharges (3.25) β†’ low correlation, unique signal.
    • TotalCharges (9.57) β†’ high collinearity; dropped for Logistic Regression only.

4. Model Development

Implemented and tuned four models using GridSearchCV (3-fold cross-validation):

  • Decision Tree β†’ Best: gini, max_depth=6, min_samples_split=2, min_samples_leaf=2.
  • Random Forest β†’ Best: entropy, max_depth=8, min_samples_split=2, min_samples_leaf=4.
  • KNN β†’ Best: kd_tree, n_neighbors=8, weights=uniform.
  • Logistic Regression β†’ Used class_weight=balanced, max_iter=1000; dropped TotalCharges to reduce multicollinearity.

5. Model Evaluation

Model Accuracy Precision Recall F1-Score ROC-AUC
Random Forest 0.812 0.733 0.458 0.564 0.872
KNN 0.802 0.630 0.607 0.618 0.856
Logistic Regression 0.807 0.669 0.534 0.594 0.847
Decision Tree 0.797 0.646 0.520 0.576 0.845

Key Insights & Recommendations:

  • Random Forest β†’ Best overall with highest Accuracy and ROC-AUC.
  • KNN β†’ Highest Recall; effective for identifying churners.
  • Logistic Regression β†’ Balanced and interpretable; ideal for explaining churn drivers.
  • Decision Tree β†’ Transparent baseline; less robust than ensembles.

πŸ‘‰ Best Use Cases:

  • Use Random Forest for predictive performance.
  • Use Logistic Regression when interpretability and business communication matter most.
  • Use KNN when maximizing Recall (catching churners) is critical.

6. Business Insights

  • Senior Citizens & Family Support: Higher churn among seniors (41%) and single customers (31–32%). Retention should prioritize these groups.
  • Service Subscriptions: Lack of Tech Support/Online Security drives churn; bundling these services reduces risk. Streaming services improve retention.
  • Contracts & Tenure: Month-to-month plans are unstable (42.6% churn). Incentives for long-term contracts can significantly reduce churn.
  • Billing & Payments: Paperless billing users churn more (34% vs. 20%). Electronic check (45.1%) is the riskiest method; auto-pay incentives recommended.
  • Charges & Loyalty: High monthly charges drive churn early, but long-tenure, high-spend customers remain loyal.

πŸš€ Key Takeaways

  • Random Forest is the most reliable model for churn prediction.
  • KNN is effective when maximizing Recall is the priority.
  • Logistic Regression provides interpretability for business decision-making.
  • Insights reveal who is at risk of churn and why, enabling targeted retention strategies.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published