Loan Approval Prediction App using Logistic Regression

An end-to-end loan approval prediction application using Logistic Regression to model borrower default risk, with SHAP-based explainability and a Streamlit app for interactive what-if analysis and decision support.

Please click here for video demo.

Skills Demonstrated

✔ Built a full pipeline from raw loan data to a deployed, interactive credit scoring interface.

✔ Supervised learning for binary classification

✔ Credit risk modelling and loan approval decisioning using Logistic Regression

✔ Feature engineering for financial datasets (Debt-to-Income (DTI), categorical encoding)

✔ Model evaluation and interpretation (recall, PR-AUC, ROC-AUC, accuracy, precision, and F1-score, SHAP explainability)

✔ Handling class imbalance with class_weight="balanced"

✔ Persisting models and scalers (pickle)

Problem Statement

Financial institutions must assess borrower creditworthiness to minimize default risk, but manual or rule-based approaches can be inconsistent and suboptimal. This project aims to build an interpretable, data-driven model that predicts whether a borrower will default on a loan and to operationalize that model in a way that supports consistent, transparent loan approval decisions.

Overview

Note:
This Streamlit application is hosted on the free Tier of Streamlit Community Cloud. If the app has been idle for more than 12 hours, it may take some time to reactivate. In such cases, please click the button saying “Yes, get this app back up!” to relaunch the application. Thank you for your patience.

A loan dataset from Kaggle is used to model borrower default behavior.

Load and clean the dataset (drop non-predictive identifiers and redundant columns such as Client_ID and Gender).
Engineer risk-relevant features, notably the Debt-to-Income (DTI) ratio derived from monthly income and repayment amounts.
Encode employment status as dummy variables (Employed, Self-Employed, Unemployed) to represent employment types numerically.
Use Logistic Regression to predict the binary Default_Flag (default vs non-default) based on features including age, DTI, credit history, and employment.
Evaluate performance with multiple classification metrics, emphasizing recall for defaulters and precision–recall/ROC curves.
Build a SHAP explainer to provide local (per-applicant) feature attribution.
Deploy the final model, scaler, and SHAP explainer in a Streamlit app that accepts user inputs, returns predicted default/repayment probabilities, applies an explicit decision threshold, and visualizes the drivers of each decision through a SHAP waterfall plot.

Key Values & Impacts

Deploying an automated, explainable loan approval and credit scoring application delivers tangible business value across lending operations:

Improved Credit Decision Consistency: Standardized risk scoring removes subjective variations, producing repeatable and defensible credit decisions that align with internal credit policy.
Risk Reduction Through Early Default Detection: Higher recall on defaulters helps reduce credit losses by catching high-risk applicants before origination rather than through collections or charge-offs.
Operational Efficiency & Reduced Cycle Times: Automated assessment shortens decision-making from minutes/hours to milliseconds, increasing application throughput and reducing the need for manual underwriting for straightforward cases.
Portfolio-Level Risk Control via Threshold Adjustment: The default probability threshold offers a tunable risk lever, allowing risk teams to balance approval volume versus risk appetite depending on market conditions and strategic objectives.
Enhanced Transparency & Explainability for Stakeholders: SHAP waterfall plots make each approval or decline auditable and interpretable, supporting compliance requirements, model governance, and fair-lending discussions.

Key Technical Decisions

Algorithm Choice

Logistics regression is chosen, with the considerations of:

Interpretability and suitability in credit risk settings.
The model’s coefficients map directly to the direction and strength of each feature’s influence on default vs repayment, which is important for explainability and potential regulatory scrutiny.

Feature Engineering

Created DTI from income and repayment to capture leverage and repayment burden.
One-hot encoded the Employment categorical variable, then converted booleans (True/False) into numeric form (1/0).
Dropped redundant raw columns (Monthly_Income, Monthly_Repayment, original Employment) once the engineered variables were in place.

Scaling Strategy

Used StandardScaler to standardize features before training Logistic Regression.
Logistic Regression benefits from standardized feature variance: scaling features to zero mean and unit variance improves solver stability and makes coefficients more directly comparable across features. This is more suitable here than MinMax scaling, which mainly rescales to a fixed range and is less convenient for interpreting linear model coefficients.
The fitted scaler is persisted and reused in the application to ensure consistent preprocessing between training and inference.

Class Imbalance Handling

Set class_weight="balanced" in Logistic Regression to give additional weight to the minority class (defaulters), reducing the risk of a high-accuracy but low-recall model on defaults.

Evaluation Focus

Evaluated the model using recall, PR-AUC, ROC-AUC, accuracy, precision, and F1-score.
Particular emphasis on recall for defaulters and PR-AUC to ensure genuine defaults are captured with acceptable levels of false positives.

Prudent Operational Threshold

Instead of using a naïve 50% default probability cutoff, a more conservative decision threshold of 35% default probability is used in the Streamlit app:

Default probability ≤ 35% → “APPROVED”
Default probability > 35% → “DECLINED”

This reflects the lender’s risk tolerance and aligns the model with business policy.

Development Pipeline

Data Preparation: Loaded Kaggle loan dataset; removed non-informative identifiers; handled duplicates and missing values.
Feature Engineering: Computed DTI; one-hot encoded employment categories; removed redundant raw columns.
Modeling: Split data (70/30), standardized features, and trained LogisticRegression(class_weight="balanced").
Evaluation: Assessed via recall on defaulters, PR-AUC, ROC-AUC, precision, F1, and confusion matrix.
Explainability: Integrated SHAP for local model attribution and decision transparency.
Artifact Persistence: Serialized model, scaler, and SHAP explainer for deployment.
Application Deployment: Built Streamlit app enabling real-time scoring, explainability, and loan approval decisions.

Author

Carmen Wong

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.devcontainer		.devcontainer
dataset		dataset
images		images
Loan_Approval_Pred_final.ipynb		Loan_Approval_Pred_final.ipynb
README.md		README.md
credit_explainer.pkl		credit_explainer.pkl
credit_model.pkl		credit_model.pkl
credit_scaler.pkl		credit_scaler.pkl
credit_scoring.py		credit_scoring.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Loan Approval Prediction App using Logistic Regression

Skills Demonstrated

Problem Statement

Overview

Key Values & Impacts

Key Technical Decisions

Algorithm Choice

Feature Engineering

Scaling Strategy

Class Imbalance Handling

Evaluation Focus

Prudent Operational Threshold

Development Pipeline

Author

About

Uh oh!

Releases

Packages

Languages

cckmwong-data/credit_scoring

Folders and files

Latest commit

History

Repository files navigation

Loan Approval Prediction App using Logistic Regression

Skills Demonstrated

Problem Statement

Overview

Key Values & Impacts

Key Technical Decisions

Algorithm Choice

Feature Engineering

Scaling Strategy

Class Imbalance Handling

Evaluation Focus

Prudent Operational Threshold

Development Pipeline

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages