In financial lending, risk is everything. Every borrower represents a probability—will they repay the loan or default? Poorly predicted defaults can cause massive losses, destabilizing entire financial institutions. Accurate credit risk analysis isn’t just a statistical problem; it’s a survival strategy.
Unlike generic machine learning pipelines, this project was built with deep consideration of finance-specific metrics like Information Value (IV) and Weight of Evidence (WOE), ensuring the models aren’t just accurate but interpretable and actionable.
The dataset used in this project is an anonymized version of the American Express Default Prediction Dataset, comprising borrower-level information such as income, credit limits, previous defaults, and more. Here's a quick look:
- Train Dataset: 45,528 records
- Test Dataset: 11,383 records
- Target Variable:
credit_card_default
(binary classification: 1 for default, 0 for non-default)
Traditional credit risk models often rely on static statistical methods that fail to capture complex, non-linear relationships in data. We wanted to push beyond these limitations by building:
- A robust pipeline that handles data preprocessing, feature selection, scaling, and class imbalance effectively.
- Multiple machine learning models with a custom evaluation framework.
- A solution with high interpretability, making it practical for real-world financial institutions to adopt.
Our workflow is broken down into several key stages:
-
Imputation of Missing Values:
- Categorical features were imputed using their mode.
- Numerical features were imputed using the median to reduce the impact of outliers.
-
Dropping Unnecessary Columns: Features like
customer_id
andname
were dropped as they do not contribute to the prediction task.
Feature engineering was a crucial step in this project, involving both statistical filtering and transformations tailored to financial data.
-
We computed the Information Value (IV) for each feature to assess its predictive power.
-
Features with IV < 0.02 were dropped, ensuring that only the most relevant features were retained.
IV quantifies the strength of a feature’s relationship with the target variable—higher IV means stronger predictive power.
- After IV filtering, we applied WOE binning to all remaining features.
- WOE scales features in a way that ensures a monotonic relationship with the target, which is crucial for models like Logistic Regression.
-
Scaling:
We applied Min-Max scaling to normalize feature values, ensuring compatibility across different models. -
Class Imbalance:
Since the dataset had significantly fewer defaults than non-defaults, we used SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes. This ensures that models don’t become biased toward predicting non-defaults.
We trained the following models:
- Logistic Regression: A baseline model, valued for its simplicity and interpretability.
- Decision Tree: Offers inherent interpretability but prone to overfitting.
- Random Forest: An ensemble of decision trees that reduces overfitting.
- XGBoost: A gradient-boosting model known for its high performance on tabular data.
- CatBoost: Another gradient-boosting model, particularly effective for categorical data.
- LightGBM: A highly efficient gradient-boosting model.
- K-Nearest Neighbors (KNN): Included for comparison, with
k=5
chosen based on error analysis.
We created a custom evaluation function to compute and display key metrics:
- Accuracy: Overall correctness of predictions.
- F1-Score: Balances precision and recall, crucial for imbalanced datasets.
- AUC-ROC: Measures a model’s ability to distinguish between defaulters and non-defaulters.
Here’s the final comparison of all models:
Model | Train Accuracy | Test Accuracy | Train F1 Score | Test F1 Score | AUC-ROC |
---|---|---|---|---|---|
Decision Tree | 95.40% | 95.63% | 95.47% | 95.70% | 98.98% |
CatBoost | 95.37% | 95.63% | 95.44% | 95.70% | 99.05% |
Random Forest | 95.40% | 95.63% | 95.47% | 95.69% | 98.99% |
LightGBM | 95.37% | 95.63% | 95.42% | 95.68% | 99.04% |
XGBoost | 95.29% | 95.48% | 95.36% | 95.55% | 99.03% |
KNN | 95.03% | 95.16% | 95.08% | 95.20% | 98.27% |
Logistic Regression | 94.30% | 94.44% | 94.40% | 94.52% | 98.78% |
CatBoost emerged as the best-performing model, with the highest AUC-ROC (99.05%) and near-perfect accuracy and F1-score.
After selecting CatBoost as the best model, we trained it on the entire balanced train dataset and generated predictions on the test dataset. The predictions were saved in the file:
/reports/test_predictions_catboost.csv
- CatBoost outperformed all other models, making it the ideal choice for deployment in real-world scenarios.
- IV filtering and WOE binning significantly improved model interpretability, which is crucial for financial decision-making.
- SMOTE balanced the dataset effectively, ensuring that the models didn’t become biased toward predicting non-defaults.
This project reimagines credit risk analysis by integrating advanced machine learning techniques with carefully crafted, finance-specific feature engineering. We present a solution that doesn’t just predict credit defaults with high accuracy but does so in a way that’s both insightful and actionable for real-world financial decision-making.
-
Hyperparameter Tuning:
Fine-tune the hyperparameters of the best-performing models to squeeze out even better performance. -
Explainability Tools:
Integrate tools like SHAP or LIME to provide detailed explanations of individual predictions. -
Deployment:
Deploy the final model as a Flask API or Streamlit app for real-time credit risk assessment.
If you’re curious about the project or want to collaborate, feel free to connect: