This is a graduate course-level research project completed by Emily Au, Alex Mak, and Zheng En Than in MATH 509 (Data Structures and Platforms) at the University of Alberta. This project strives to predict whether bank clients will subscribe to term deposit subscriptions through tree-based machine-learning classifier models (Decision Tree, Random Forest, and XGBoost).
- We utilize tree-based machine-learning models to predict whether a client will subscribe to a term deposit through direct marketing campaigns.
- Identifying the significant factors influencing a potential client's decision to subscribe to a term deposit
- Determine the predictive accuracy of our classifier models in forecasting subscription outcomes
- Observe the predictive performance impact of utilizing bagging and boosting techniques on tree-based machine-learning models
- Entire codebase of the project (including data preprocessing, feature engineering, predictive modeling, model evaluation, and data visualization).
- The previous versions of the codebase are also stored.
- The dataset used in this project, both the raw and processed dataset.
- Bank Marketing dataset from UCI (UC Irvine) machine learning repository (https://archive.ics.uci.edu/dataset/222/bank+marketing).
- The fitted Model and their corresponding parameters after being trained in this project.
- The finalized report of our project.
- The legacy version of the report is also stored.
- The visualizations generated from Python (matplotlib and seaborn), and Tableau.
- An influential presetnation to convey our findings and insights
We have conducted the following steps in our project:
- Data Preprocessing
(data cleaning and transformation, anomaly detection analysis, exploratory data analysis) - Feature Engineering
(feature importance, feature selection) - Statistical Machine learning Model Development
(model training and fitting, model evaluation, model optimization, model prediction) - Data Visualization
(within and between models)
- The most important features are: last contact duration, outcome of the previous marketing campaign, and day of year.
- Bagging and boosting bring performance improvement from the Decision Tree for this specific problem and dataset.
- Numerical Results:
Model | Training Accuracy | Testing Accuracy | Tuning Combinations | Compuation Time |
---|---|---|---|---|
Decision Tree | 86.76% | 89.04% | 2592 | ~ 10 Minutes |
Random Forest | 91.49% | 90.22% | 1024 | ~ 20 Minutes |
XGBoost | 92.38% | 91.00% | 576 | ~ 40 Minutes |
- The optimized models implemented in this project are deployed in a streamlit web application!
- Please clone this repo, then go to Code --> Model_Deployment, and enter the folloiwng command:
streamlit run Deployment_Codebase.py
The following screenshots are what the app looks like when it's deployed.
- Initialization
- Successful Prediction
- Failed Prediction
- Ensemble methods (in Random forest and XGBoost) can be more complex than Decision Tree, making it challenging to interpret the reasoning behind each prediction.
- Limited generalizability as the dataset consists of data from a Portuguese bank and its specific marketing approach.
- We would like to re-examine this project with a different dataset, where it may come from another bank in the world with a different telemarketing campaign.
- We are interested in further optimizing our tree-based machine learning models, but that also comes with the drawback of consuming additional computational resources.
- We are looking forward to implementing gradient-boosted random forest (GBRF), which incorporates both bagging and boosting in a tree-based model. We can analyze the impact of using both bagging and boosting compared to just one of them at a time in Random Forest and Decision Tree.
- We would conduct more in-depth analysis, such as exploring any temporal patterns or clustering the data based on client demographics to provide deeper insights into customer behavior, ultimately helping banks devise more effective targeted marketing strategies.