A financial company offers consumer credit for people with little or no loan history, and wishes to implement a “credit scoring” tool to decide whether to accept or refuse credit.
This project aims to:
- Develop a scoring model to predict the probability of payment default for these customers, based on mostly financial data sources
- Develop an interactive dashboard for customer relationship managers to explain credit granting decisions as transparently as possibleS
This is project 7 for the Master in Data Science (in French, BAC+5) from OpenClassrooms.
The project demonstrates separation of concerns: model code, API and dashboard:
- code : Handling imbalanced data for a binary classification model
- API : Creation of an application programming interface to serve the saved model (to any number of dashboards)
- dashboard : Visualisation of the data from the api: predicted scores and their interpretation
Data : The dataset (~700Mb) and descriptions can be downloaded from https://www.kaggle.com/c/home-credit-default-risk/data. It consists of financial data of 307511 anonymized customers, provided in seven tables, with a target column 'TARGET' informing if the client repaid his loan (0) or was in default (1)
Python libraries : This project is composed of 3 phases :
- the modelling code : data integration, cleaning and creation of the classification model
- the scoring model api : a backend for serving model predictions
- the interactive dashboard : a frontend for visualising model scores and their interpretation for a selected client
The python requirements for each phase are similar (see requirements.txt), but not identical.
- code :
imbalanced-learn, numpy, pandas, matplotlib, seaborn, scikit-learn, lightgbm, yellowbrick, shap
- api :
flask, gunicorn, numpy, pandas, scikit-learn, lightgbm, shap
- dashboard :
streamlit, ipython, pandas, matplotlib, scikit-learn, lightgbm, shap
For maintenance and reduced deployment dependencies, each of these 3 phases should have their own requirements.txt, in separate version controlled git submodules.
Notes : Files are in French. Open https://nbviewer.org/ and paste notebook GitHub url if GitHub takes too long to render._
The main files are:
-
code/P7_eda_nettoyage.ipynb: Exploratory Data Analysis (EDA) and data cleaning notebook (joining and aggregating data from 8 tables).
-
code/P7_modelisation.ipynb: Development of the credit scoring model, handling imbalanced data and using a custom scoring threshold.
-
Note Méthodologique.pdf : Model training methodology, business cost function, evaluation metric, global and local interpretability
-
P7_presentation.pdf: Presentation slides
Code for the model, API and dashboard are in the "code", "api" and "dashboard" folders respectively.
The financial tables (loan request, repayment history, previous loans, external data) were merged by JOIN on the customer key (SK_ID_CURR), with a script that had already produced good classification results, with a few adaptations, resulting in 602 numeric columns for 307507 customers
The data has already been explored in detail during a kaggle competition:
This exploration shows that:
- the distribution of the target is very unbalanced:
- less than 8% of customers are in default.
If we make a prediction that all customers are good, we will have an accuracy of 93% for the majority class, but we will have identified no defaulting customers.
The cleaned dataset was divided between the train (80%) and test (20%) datasets. A pre-processing pipeline was set up to avoid data leakage. Missing values were replaced by the median value (all columns are already numeric). For feature selection and modeling, where needed, the data was scaled with StandardScaler.
Most of the 600 columns have very little correlation with the target, and simply add noise to the model.
To improve modelling time, interpretability and model performance, the top 100 features were selected by a set of feature selection methods (https://www.kaggle.com/code/sz8416/6-ways-for-feature-selection/ ): Filter(KBest,Chi2), Wrapper(RFE),Embedded(SelectFromModel: LogisticRegression, RandomForest, LightGBM).
Highly collinear columns (VIF > 5) were eliminated (https://www.researchgate.net/publication/350525439_Feature_Selection_in_a_Credit_Scoring_Model_Mathematics_ISSN_2227-7390 )
The final dataset consisted of 79 features for 307507 customers.
For many of the classifiers, the hyperparameter class_weight = 'balanced' allows to take into account imbalances in the target class (cost-sensitive). Several strategies of the imbalanced-learn library were also tested to rebalance the target classes: Random undersampling (majority class); Random oversampling (minority class); Synthetic Minority Oversampling Technique (SMOTE); SMOTE TomekLinks - (majority class under sampling).
To compare the influence of sampling strategy on the performance of models in an acceptable time, a dataset sample of 10000 was used. Once the sampling strategy, the hyperparameters and the model were chosen, the final model was trained and optimized on the dataset set. The classifiers tested were: Dummy (Baseline), RidgeClassifier, LogisticRegression, RandomForest, and LightGBM (Gradient boosting)
An imblearn pipeline allows us to tune the choice of preprocessing, sampling and classifier, to ensure that cross-validation scores were tested on data without rebalancing.
Several evaluation metrics were calculated: precision, recall, F1-score, ROC_AUC, the aim being to minimize false positives (maximum precision) and false negatives (maximum recall)
The choice of the best model was made by retaining the model with the best ROC_AUC score on the test set.
The ROC_AUC measures the Area Under the Curve. It shows the trade-off between specificity and sensitivity (https://en.wikipedia.org/wiki/Sensitivity_and_specificity)
The closer the curve approaches the upper left corner, the better the specificity and sensitivity (and therefore precision and recall)
For decision tree methods, the SMOTE seems to have the effect of overfitting on the training game, because on the test game we see a significant drop in predictive ability.
The Light LGBM model without resampling, but with parameters {class_weight = balanced, max_depth=6} is the best performing (high ROC_AUC score on the test data, faster to compute), and therefore is chosen as the best model.
For the bank, the cost of providing a loan to a customer who does not repay his loan (false negative (FN)-type II error) is more than the loss of refusing a loan to a customer who will not have loan problems (false positive (FP) – type I error).
- Recall =
TP/(TP+FN)
: maximise recall == minimise the false negatives - Precision =
TP/(TP+FP)
: maximiser precision == minimise les false positives - F1 score is a balance between precision and recall. =
2 * precision * recall / (precision + recall)
To place more weight on recall, we can use the F(beta>1) score: An approximation of the cost for the bank will be to use F(beta=2) score:
f2_score = 5*TP/(5*TP+4*FN + FP)
A function which estimates the cost for the bank (normalized to stay between 0 and 1, as for the other scorers):
profit = (TN
*value_per_loan_to_good_customer + TP * value_of_refusal_of_loan_to_bad_payer)
loss = (FP * cost_per_loan_refused_to_good_customer + FN * cost_of_giving_a_loan_to_bad_payer)
custom_credit_score = (profit + loss) / (max_profit – max loss)
Where
max_profit = (TN + FP)*tn_profit + (FN+TP)*tp_profit
(give loans only to good payers) ; andmax_loss = (TP+FN)*fn_loss + (TN+FP)*fp_loss
(give loans only to bad payers)
For this model, we suppose : tnprofit=1, fp_loss=-0.5, fn_loss=-10, tp_profit=0.2 So,
custom_credit_score = (TN + 0.2TP - 10FN - 0.5*FP) / (max_profit-max_loss)
3.2 The optimization algorithm The model provides probability values (“pred_proba”) that a customer
will be a good payer (0) and a defaulter (1) • If y_pred = (pred_proba[:,1] > threshold) ==1 (True),
we consider that the customer is defaulting Metrics are calculated by comparing y_pred with true
values (y_true). We retrieve the rate of false positives and negatives from the confusion matrix: •
(TN, FP, FN, TP) = metrics.confusion_matrix(y_test, ypred).ravel()
By changing the discrimination threshold (solvency threshold), we can calculate the business cost function, to find the optimal threshold for a given business function: For the chosen model, the optimal threshold is 0.520, coincidentally close to the default threshold = 0.5 We optimize the model on AUC, then predict ready accepted or refused using the optimal threshold
The model provides the (impurity-based) weights of the model.featureimportances, based on the training data.
We can also use sklearn.inspection.permutation_importance to estimate the (entropy-based) feature importance, based on the permutation of values in each feature of the test data
The SHAP method (https://shap.readthedocs.io/ ) calculates the shap_values: the impact of a variable (on the prediction) for each line of data. SHAP values are additive: values in red increase the predicted value (risk of failing), a blue value reduces the prediction (risk of failing).
If we take the average of the SHAP values for each feature, we obtain the importance of the features for the prediction.
We can visualize the distribution of the values of for the most important features via a 'summary plot', in the form of beeswarm or violin
Negative contributions have an effect of reducing the value of the prediction.
Low risk customer (prob=0.03) ` High risk customer (prob=0.95)
The prediction is made by a Flask application, written in python with the routes:
- List of client ids: /clients/
- Customer data: /customer/
- Prediction (default probability): /predict/
- Client SHAP explanation: /explain/
- Global SHAP explanation: /explain/
Deployment is under Heroku at https://mc-oc-7.herokuapp.com
The api source code can be found in the api
folder (see api/README.md for instructions)
The dashboard makes requests to the API, because it does not have access to the data or the model. It is written in python and streamlit, and deployed on share.streamlit.io at the address: https://mrcreasey-oc-ds-p7-scoring-dashboard-dashboardmain-70agjx.streamlitapp.com/
The source code for the dashboard can be found in the dashboard
folder (see dashboard/README.md for instructions)
- The models were calculated on part of the data: it is necessary to analyze the effect of sample size on the results (eg via learning curves)
- We cannot completely separate the good payers from the defaulting customers (the ROC_AUC of the training data remains between 0.7 and 0.8)
- The application of SMOTE improves training scores, but not validation scores, for the sample size used.
- SMOTE quickly becomes too heavy to apply on the entire dataset: the generation of synthetic points is very slow, and models created with SMOTE (via imblearn.Pipeline) are too large to be saved
- It is necessary to make the choice between errors of type I (accuracy) and errors of type II (recall)
- For the bank, the recall is the most important
- Review feature creation with industry experts: The cleanup, aggregation, merging and feature engineering script used seems to have been done without knowledge of the business – many of the variables created by the script are irrelevant or duplicated.
- Review the strategy for dealing with missing values (default median)
- Improve the selection of features to be adapted to each model (Wrapper/Embedded)
- Make learning curves to optimize sample size for models
- Increase the search for the best model hyperparameters
- Change from Flask API to fastapi (https://fastapi.tiangolo.com/) – faster, automatic request documentation, less lines of code, includes authentication and security
- Add authentication to access the dashboard
- Added encryption of customer data
- Store customer data separately from the API (this requires caching it in the API memory, otherwise it becomes too slow) – for example in an S3 bucket on AWS
- Visualize the distribution of each of the most important features for a given client to better understand where the client stands among the clients
- Supervised classification, stratified k-fold, cross-validation
- Handling imbalanced data: cost-sensitive, imbalanced-learn, SMOTE, Tomek Links, undersampling, oversampling
- Preprocessing : Filter, embedded, wrapper methods
- Performance metrics: Precision, Recall (sensitivity), specificity, Area Under Curve(AUC), Receiver Operating Characteristic (ROC), ROC_AUC, F1-score, F(beta)-score
- Performance evaluation : Custom cost function, Discrimination threshold
- Interpretability : Permutation importance (impurity-based vs. entropy-based), SHAP(global, local interpretability)
- REST API : Flask, FASTapi, heroku
- Dashboard : Streamlit
- Handling imbalanced data
- Using code release software to ensure model integration
- Deployment of a model via an API in the web
- Creation of an interactive dashboard to present model predictions
- Communication of modelling approach in a methodological note