Masterschool's capstone project integrating skills and tools for data analysis.
- Identify the top performers: top customer, top product, and top category
- Identify the customer segments based on RFM
- Did higher-priced products contribute to higher sales than the lower-priced products?
- Compared to other months, were sales higher during the Christmas season (December)?
- Data Preprocessing
- Exploratory Data Analysis
- Customer Segmentation based on RFM Metrics (using percentile ranking and K-means clustering)
- Product Categorization & Product Category Analysis
- Statistical Hypotheses
- Insights
- Dashboard (Tableau)
The project is composed of five Jupyter notebooks:
- 1_data_prep_and_EDA.ipynb
- 2_customer_rfm_segmentation.ipynb
- 3_product_categorization.ipynb
- 4_product_category_analysis.ipynb
- 5_statistical_hypotheses.ipynb
The summary of observations, insights, conclusion, and resources can be found in 5_statistical_hypotheses.ipynb notebook.
The dataset is a modified version of the online retail dataset sourced from UCI Machine Learning Repository. It contains 541,909 transaction records from 2018-11-29 to 2019-12-07 which has seven attributes:
- InvoiceNo: Invoice reference number uniquely assigned for each transaction. If the InvoiceNo starts with 'C', it indicates a cancellation
- StockCode: Product or item code uniquely assigned to each distinct product
- Description: Product or item name
- Quantity: The quantities for each product or item per transaction
- InvoiceDate: Invoice date and time, the day and time the transaction was generated by the system
- UnitPrice: Product price per unit
- CustomerID: Customer reference number uniquely assigned to each customer