As part of the Big Data and AI Engineering Onsite Bootcamp, we are asked to deliver a solution for the Saudi market that can be solved by data science. The project has to have an impact and deliver a solution for a real-world problem using Saudi datasets.
Table of Contents
This is the overview of the project's structure and files for easier navigation. However, some notebooks and datasets cannot be uploaded either to ensure the company's confidentiality or due to size limits:
├── README.md
├── CapestoneProject_Dashboard_Desert_Ninjas.pdf
├── CapstoneProject_Presentation_Desert_Ninjas.pdf
├── Notebooks
│ ├── CapstoneProject_Pre_Preprocessing_Notebook_ComanyNameEncryption.ipynb
| ├── CapstoneProject_Preprocessing_Notebook_Desert_Ninjas.ipynb
│ ├── CapstoneProject_EDA_Notebook_Desert_Ninjas.ipynb
│ └── CapstoneProject_ML_Notebook_Desert_Ninjas.ipynb
└── Datasets
├── Encrypted_full_dataset.csv (output of the pre-preprocessing notebook)
├── Encrypted_exported_raw_data.csv (output of the pre-preprocessing notebook)
├── Preprocessed_full_dataset.csv (output of the preprocessing notebook)
└── Final_extracted_dataset.csv (used for the EDA, Dashboard, and Machine Learning models)
Note: As a beginning, we were provided with two datasets that contain different schemas (Encrypted_full_dataset + Encrypted_exported_raw_data)
The purpose of this project is to predict potential customers for a FinTech startup company using their visitor's activity logs. Those potential investors would then be targeted with marketing strategies.
- Preprocessing raw data
- Feature Engineering
- Feature Selection
- Labeling and classifying the data
- Exploratory Data Analysis
- Data Visualization
- Machine Learning
- Oversampling
- Python, Jupyter
- Pandas
- Plotly
- Sklearn
- Imbalanced-learn
- Power BI
A startup FinTech company named X is interested in knowing its customers’ behaviors and whether they’re going to invest based on their activity logs. However, the problem has challenges because we don't have the following to support our analysis:
- The number of visitors to the website
- The demographics of these visitors
The analysis will help the company create a new marketing strategy for attracting more customers, increasing its revenues, and learning the patterns of customers who reach the investment pages but do not commit to the full transaction. Lucky for the FinTech company, we say, challenge accepted!
At the beginning of our analysis, we raised some questions that we intend to answer using our EDA, dashboard visualization, and modeling. The questions are:
- What kind of data does their website collect from users?
- What is the path that gets visited by users usually? And how much time do users spend on this path?
- Does the average time spent on a page differ based on the user type?
- Which path has the maximum time? Is this the path that leads to a successful transaction (investment)? We hope to answer all of these questions in our analysis.
Preprocessing is the essence of this project. In this README file, we will be listing the overview of each step. However, for a more detailed description, visit our Medium Blog Post.
The dataset before and after the preprocessing:
Preprocessing steps:
Feature engineering steps:
Features before removing data leakage:
Selecting the features after removing the data leakage:
Based on our EDA, we found that 80% of our users are regular visitors, while only 17% are investors, thus, we wanted to create two dashboards for these two user types.
Visitors dashboard:
Investors dashboard:
As mentioned above, you can visit our web blog for a detailed analysis of the project.
All of these models were evaluted in order to choose the best one of them.
However, in our criteria, since our dataset is imbalanced, we will take recall as our evaluation metric. Also, we want to focus on identifying the potential customers class, so, we took the best model in identifying this class as compared to our baseline; which is XGBoost.
XGBoost results:
Baseline Distribution:
Team Leadear: Reema Alaswad (Reema's LinkedIn)
Name | |
---|---|
Raghad Aleisa | Raghad's LinkedIn |
AlJohara Alkanhal | AlJohara's LinkedIn |
Maha AlHazzani | Maha's LinkedIn |
Eman Aldosari | Eman's LinkedIn |