Data source : https://www.kaggle.com/prachi13/customer-analytics
Table of Contents
• Stage 1 : We focus on Data Exploration, Exploratory Data Analysis, Business Insight and Visualization
• Stage 2 : We focus on Data Cleansing and Feature Engineering
• Stgae 3 : And then on last stage, We focus Modeling and Evaluation
Overall Project :
• Seek insight from the dataset with Exploratory Data Analysis
• Performed data cleansing, data processing, data engineering to prepare data before modeling
• Built a model to predict whether the shipping deliveries will be received late or on time by the customers
• Developt recommendations & benefit analysis based on insights and model prediction
An international e-commerce company that sells electronic products want to discover key insights from their customer database. Currently, most of the shipping deliveries are late.
Variable | Type | Definition | Example |
---|---|---|---|
ID | Nominal | Customer ID Number | 10, 15, 10995, 10996 |
Warehouse_block | Nominal | Warehouse to Store the Product | A, B, C, D, F |
Mode_of_Shipment | Nominal | Mode of Product Shipping | Flight, Road, Ship |
Customer_care_calls | Discrete | Number of Calls Made | 1, 2, 5, 6 |
Customer_rating | Ordinal | Company Rating by Customers | 5: Best - 4: Better - 3: Neutral - 2: Bad - 1: Worst |
Cost_of_the_Product | Discrete | Cost of Product in US Dollars | 177, 216, 236, 182 |
Prior_purchases | Discrete | Number of Prior Purchase | 3, 2, 6 |
Product_importance | Ordinal | Product Importance Parameter | Low, Medium, High |
Gender | Nominal | Customer Gender | Male, Female |
Discount_offered | Discrete | Product Discount in US Dollars | 65, 10, 16 |
Weight_in_gms | Continous | Product Weight in grams | 4953, 5676, 2171 |
Reached.on.Time_Y.N | Nominal | Target Variable, 1: NOT reached on time - 0: REACHED on time | 1, 0 |
-
59.7% of e-commerce shipping deliveries are late received by the customers (6.563 of 10.999 customers).
-
Ship & Warehouse F has the highest frequency of delivery. But it looks almost the same based on the percentage. There's an assumtion that the late is influenced by other factors.
-
Every product that gets a discount above 10 is confirmed Late. There is an assumption that this happens in specific months, but needs further checking.
-
Shipping delivery is confirmed late when the product weight is between 2-4 kg.
• Check missing & duplicate values
• Remove outliers with z-score
• Ordinal encoding for Importance
column & feature encoding the rest of categorical columns
• Select best features for modeling
• Normalize & Standarize all selected features
• Split features & target
• Split data into data train & data test
• Train model with 5 different algorithm such as Decision Tree, Logistic Regression, Random Forest, XGBoost , KNN, & Lightgbm
• Evaluate model with Accuracy, Precision, Recall, F1-Score and AUC and focus on AUC Score
• Hyperparameter tuning
• Select the best model
Model | Accuracy | Precision | Recall | F1-Score | AUC |
---|---|---|---|---|---|
Decision Tree | 0.65 | 0.72 | 0.66 | 0.69 | 0.65 |
Logistic Regression | 0.58 | 0.58 | 1.00 | 0.73 | 0.50 |
lightgbm | 0.66 | 0.76 | 0.60 | 0.67 | 0.739 |
KNN | 0.66 | 0.78 | 0.56 | 0.65 | 0.67 |
Random Forest | 0.68 | 0.82 | 0.56 | 0.67 | 0.70 |
XGBoost | 0.65 | 0.71 | 0.67 | 0.69 | 0.65 |
• Add estimatedarrival time to assure the package arrived on time
• Give credit points as a compensations to retain customer loyalty
• Add more features to give more specific & accurate insights
• Perform operational audit based on the insights