This project focuses on predicting two essential outcomes for auto insurance claims:
- The probability of a car crash (
TARGET_FLAG
) - Binary Logistic Regression - The potential claim amount if a crash occurs (
TARGET_AMT
) - Multiple Linear Regression
Accurate predictions of accident probability and claim amounts allow insurance providers to assess risks better, set fair premiums, and handle claims efficiently. This project uses advanced data preparation and modeling techniques to maximize prediction accuracy, ensuring models are ready to tackle real-world scenarios with high variability.
The project follows a systematic data preparation and modeling workflow to handle this high-dimensional dataset, clean inconsistencies, address class imbalances, and tackle multicollinearity. Below is the comprehensive flowchart of the workflow:
- Full Model: Includes all predictors to assess the overall feature impact.
- Stepwise Model: Refined using stepwise selection to improve simplicity and interpretability.
- Null Model: Serves as a baseline.
- Full Model: Includes all predictors to explore all possible risk factors.
- Stepwise Model: Adds preprocessing steps (removing near-zero variance and correlated features) for a leaner, more focused model.
Each model was evaluated on a set of important metrics to identify the best-performing approach.
MLR models were evaluated on Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, Adjusted R-squared, and F-statistic:
- MSE & RMSE: These metrics help measure the average and root error between actual and predicted values, providing insight into prediction accuracy.
- R-squared & Adjusted R-squared: These scores indicate the proportion of variance in the target variable explained by the model. Adjusted R-squared adjusts for the number of predictors, giving a more accurate assessment as predictors are added.
- F-statistic: Assesses the overall significance of the model, with higher values indicating a better fit.
The Stepwise MLR model slightly outperformed the Full Model with a higher Adjusted R-squared and F-statistic, indicating a more parsimonious model with similar predictive power.
BLR models were assessed with Accuracy, Error Rate, Kappa, Precision, Sensitivity, Specificity, F1 Score, and AUC (Area Under the Curve):
- Accuracy & Error Rate: Measure the model's correctness and error rate, providing a straightforward performance overview.
- Kappa: Indicates how well the predictions match the actual values, adjusted for agreement by chance, offering a fairer metric than accuracy in imbalanced datasets.
- Precision & Sensitivity: Evaluate the model's ability to correctly identify positive cases (crash likelihood), essential in risk prediction.
- Specificity: Indicates the model’s ability to correctly classify non-crash cases.
- F1 Score & AUC: F1 balances precision and sensitivity, while AUC reflects the overall ability to discriminate between crash and non-crash cases.
The Stepwise BLR model achieved the best AUC, Kappa, and F1 scores, demonstrating balanced predictive power with reduced predictor redundancy.
By analyzing these metrics, the Stepwise models for both MLR and BLR were chosen for their ability to balance predictive power with simplicity. These models were then retrained on the full dataset to produce robust final models for prediction on unseen data.
- Data: Source data files.
- Resources: Supporting images and charts for reference.
- Code: R scripts for data preparation, model building, and evaluation.
- Final Predictions: Exported predictions on test data for easy access.
Happy analyzing! ✨