This is a machine learning project done on a Kaggle Dataset: https://www.kaggle.com/arkapravasen/bank-loan-default
Data Science Techniques:
- Exploratory Data Analysis
- Point Biserial Correlation
- Pearson's Correlation
- Cramer V's Correlation
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Tree Algorithms (XGBoost)
-
To reject the individuals who are at very high risks of defaulting
-
To ensure that individuals capable of repaying are given the loan
-
To scale down the loans given to individuals at higher default risk to reduce potential losses while enabling potential gains
In short, banks consider an individual who was given the loan yet defaulted as a false positive whereas an individual who was not given the loan yet could have repaid on time is considered as a false negative.
In both cases, since banks missed out on an opportunity to profit, they are interested in reducing the amount of false positives and false negatives.
After several iterations, our group decided to go with the XGBoost model which gave the best scores across various metrics.
| Metrics | Score |
|---|---|
| F1 | 74.6% |
| Recall | 77.6% |
| Precision | 71.8% |
| Accuracy | 73.5% |
With this, we can potentially lower losses by $762,597,495.00 (or 18.23%). This is because clients, albeit the lower loan amount granted, will still have to pay interest. As such, the bank still gains a semblance of revenue. This is favoured rather than completely losing potential income by rejecting the clients upfront.
With this, we expect an increase of $393,106,511.23 (or 8.03%) in interest income gained from customers below the age of 25.
- Lua Jun An
- Keith Tay Xiang Rui
- Tu Zhehao
- Timothy Wong Hoey Pheen
- Ahmad As-Shodiqqul Amin
- Dione Lim Yee Sze
- Kellie Chin Shu Wen
- Ni Hui Ling