Bank Churn Prediction ML App

The main goal of this project is to predict whether a customer continues with their account or closes it (e.g., churns). It is the 1st episode of Kaggle Playground Series - Season 4.

App Link: Bank Churn Prediction ML App
Evaluation Metric: The evaluation metric is area under the ROC curve between the predicted probability and the observed target
App Demo Screenshot:

Tech Stack

Data Visualization & Manipulation: matplotlib, seaborn, numpy & pandas.
Model Building: scikit-learn, xgboost, lightgbm & catboost.
Deployment: joblib & streamlit.
Versionning & Others: Git, Github, VSCode, os & argparse.

Data sources

Competition Dataset: The dataset for this competition (both train and test) was generated from a deep learning model trained on the Bank Customer Churn Prediction dataset. Feature distributions are close to, but not exactly the same, as the original.
Original Dataset: The bank customer churn dataset is a commonly used dataset for predicting customer churn in the banking industry. It contains information on bank customers who either left the bank or continue to be a customer. The dataset includes the following attributes:

Column name	Description
Customer ID	A unique identifier for each customer
Surname	The customer's surname or last name
Credit Score	A numerical value representing the customer's credit score
Geography	The country where the customer resides (France, Spain or Germany)
Gender	The customer's gender (Male or Female)
Age	The customer's age
Tenure	The number of years the customer has been with the bank
Balance	The customer's account balance
NumOfProducts	The number of bank products the customer uses (e.g., savings account, credit card)
HasCrCard	Whether the customer has a credit card (1 = yes, 0 = no)
IsActiveMember	Whether the customer is an active member (1 = yes, 0 = no)
EstimatedSalary	The estimated salary of the customer
Exited	Whether the customer has churned (1 = yes, 0 = no)

Code Structure

Screenshot:

Description:

folders	Description
input/	consists of all the input files and data for the project (train, test, original, sample_submission)
media/	only contains images for the README.md file
models/	keeps all the trained models saved with joblib.
notebooks/	all jupyter notebooks (i.e. any *.ipynb file) are stored in the notebooks folder
output/	contains all the models predictions(i.e submissions to the competition)
src/	contains all the python scripts(i.e. any *.py file) as well as a bash script (i.e. run.sh file) to automate model training process
README.md	markdown file where I described the project. It's what you're reading right now
app.py	python file for streamlit deployement
requirements.txt	contains all the packages neede by the project to run and their versions

Results and evaluation

My Approach: Firstly I setted up a local 5 fold cross validation using the create_folds.py file. Then, I tried five tree-based models (eg. xgboost, lightgbm, catboost, gbm and histgbm) and then I ensemble the predictions on a weigthwed averave. I tried different Feature Engineering methods but looked like they didn't improve too much the score. As I didn't have too much time for this competition I only tuned some hyperparameters of the histgbm model using RandomizedSearchCV.

What's more, it seems that the winner actually used only CatBoost, and I'm happy about that (I can even brag about it) because I was probably the first to publish a public notebook on Catboost baseline at the beginning of the competition.

Here is the link to my first notebook about Catboost baselise[PS4-EP1] Bank Churn | EDA📊 + CatBoost baseline🦾 I also added it to the notebooks/ folder.
First place solution wrappup:
1. It was important to recognize that CustomerId and Surname are high-cardinality categorical variables and encode them correctly.
2. Catboost was enough to win this competition. My final submission was based on one Catboost model averaged across 20 folds.
3. If you tuned Catboost parameters well you would be in Top 3. There're a lot of parameters to tune and there are many ways to encode categorical features. I now realize that my 1st place solution can be improved even further.
My Kaggle Leaderboard results:
- Public LB AUC score : 0.88666
- Private LB AUC score : 0.88938
- LB rank : Although I didn't commit 100% to this competition, I finished 1101 out of 3633 participants (Top 30%). I'm sure I'll do better in future Playgroud series :)

How to Run this Project Locally?

First fork the repo to your GitHub account
Clone the project

  git clone https://github.com/ANYANTUDRE/Binary-Classification-with-a-Bank-Churn-Dataset

Go to the project directory

  cd Binary-Classification-with-a-Bank-Churn-Dataset

Install dependencies

  pip intall requirements.txt

Start the server

  streamlit run app.py

Contributing

All kind of contributions are always welcome!

Authors

@anyantudre

License

Competition Dataset:
@misc{playground-series-s4e1,
author = {Walter Reade, Ashley Chow},
title = {Binary Classification with a Bank Churn Dataset },
publisher = {Kaggle},
year = {2024},
url = {https://kaggle.com/competitions/playground-series-s4e1}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bank Churn Prediction ML App

Tech Stack

Data sources

Code Structure

Results and evaluation

How to Run this Project Locally?

Contributing

Authors

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
input		input
media		media
models		models
notebooks		notebooks
output		output
src		src
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

ANYANTUDRE/Binary-Classification-with-a-Bank-Churn-Dataset

Folders and files

Latest commit

History

Repository files navigation

Bank Churn Prediction ML App

Tech Stack

Data sources

Code Structure

Results and evaluation

How to Run this Project Locally?

Contributing

Authors

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages