This project focuses on building a machine learning pipeline to predict whether a customer will file a car insurance claim based on policy, demographic, and vehicle-related features.
The motivation behind this project is to assist insurance companies in:
- Risk Assessment: Identifying high-risk customers more accurately.
- Operational Efficiency: Reducing manual effort in claim risk analysis.
- Customer Retention: Offering personalized services and pricing strategies.
Source: Kaggle Car Insurance Claim Dataset
Size: ~40,000 policyholder records
Target: Claim Status (1 = Claim, 0 = No Claim)
Features include: Policy tenure, age of car, vehicle type, population density, premium amount, income group, region type, etc.
1. Exploratory Data Analysis (EDA)
- Data cleaning (missing values, duplicates, outliers)
- Feature distributions and correlations
- Class imbalance detection
2. Feature Engineering
- Encoding categorical features
- Scaling numerical features
- Handling class imbalance using oversampling(SMOTE), random undersampling, and class weights
3. Model Training
Trained and compared multiple models:
- Random Forest
- Logistic Regression
- CatBoost
- XGBoost
4. Model Evaluation
- Metrics: Accuracy, Precision, Recall, F1-score
- Compared models on the same test set
- Best performing model: RandomForest with adjusted class weight parameter
- Achieved macro Average F1-score: 51%