This project aims to develop a predictive model that determines the likelihood of loan approval based on various features of loan applicants. By leveraging machine learning techniques, we analyze historical loan data to predict outcomes for new loan applications.
Loan-approval-prediction/
│
├── 📓 LoanApprovalEDA.ipynb # Jupyter Notebook for Exploratory Data Analysis
├── 📦 final_model.pkl # Trained model for loan approval predictions
├── 📦 loan_approval_model.pkl # Model used for predictions
├── 📊 loan_approval_predictions.csv # Predictions made on new data
├── 🧪 test_Y3wMUE5_7gLdaTN.csv # Test dataset
└── 📚 train_u6lujuX_CVtuZ9i.csv # Training dataset
The training dataset consists of 614 entries and the following columns:
Column | Description |
---|---|
Loan_ID | Unique identifier for each loan |
Gender | Gender of the applicant |
Married | Marital status of the applicant |
Dependents | Number of dependents |
Education | Educational qualification |
Self_Employed | Employment status of the applicant |
ApplicantIncome | Income of the applicant |
CoapplicantIncome | Income of the coapplicant |
LoanAmount | Amount of loan applied for |
Loan_Amount_Term | Duration of the loan in months |
Credit_History | Credit history of the applicant |
Property_Area | Area of residence |
Loan_Status | Approval status (Y/N) |
The test dataset has the same structure as the training dataset but does not contain the Loan_Status column.
- Conducted statistical analysis to understand the distribution of features.
- Visualized the loan approval status distribution, highlighting class imbalance.
- Identified categorical and numerical columns for further processing.
-
Missing Value Imputation:
- Imputed missing values in
LoanAmount
,Loan_Amount_Term
, andCredit_History
using appropriate strategies.
- Imputed missing values in
-
Encoding Categorical Variables:
- Converted categorical features into numerical representations using one-hot encoding.
-
Scaling:
- Scaled numerical features to normalize their distributions.
-
Feature Selection:
- Dropped unnecessary columns such as
Loan_ID
before model training.
- Dropped unnecessary columns such as
-
Train-Test Split:
- Split the training data into training and validation sets.
-
Model Selection:
- Evaluated various models, including Logistic Regression, Decision Tree, and Random Forest.
-
Hyperparameter Tuning:
- Utilized GridSearchCV for hyperparameter tuning of the Random Forest model.
-
Model Evaluation:
- Evaluated models based on accuracy, precision, recall, and F1-score.
Model | Accuracy | Key Parameters |
---|---|---|
Logistic Regression | 79% | - |
Decision Tree | 68% | - |
Random Forest | 78% | max_depth : None, max_features : 'sqrt', min_samples_leaf : 1, min_samples_split : 10, n_estimators : 50 |
Final Model Accuracy: 78.86%
- Address Class Imbalance: Implement techniques to address class imbalance to improve model performance.
- Feature Engineering: Explore additional features that could enhance prediction accuracy.
- Deployment: Create a web application for real-time predictions of loan approvals.
To use the trained model for making predictions, load the final_model.pkl
and preprocess your input data similarly to the training data preprocessing steps.
import pandas as pd
import joblib
# Load model
model = joblib.load('final_model.pkl')
# Preprocess new data
new_data_processed = preprocess(new_data)
# Make predictions
predictions = model.predict(new_data_processed)
- Pandas for data manipulation
- Scikit-learn for machine learning algorithms
- Matplotlib and Seaborn for data visualization