This project involves the analysis of a wholesale customer dataset. The dataset contains annual spending information (in monetary units) for various product categories. The primary objective is to segment customers based on their purchasing behavior and gain insights into their preferences.
- Project Description
- Getting Started
- Exploratory Data Analysis (EDA)
- Clustering Analysis
- Principal Component Analysis (PCA)
- Results
- Contributing
The project involves the following key steps:
-
Exploratory Data Analysis (EDA): This phase focuses on understanding the dataset, cleaning and preprocessing the data, and generating insights through various visualizations and statistical summaries.
-
Clustering Analysis: The dataset is clustered using unsupervised machine learning techniques, such as K-means and hierarchical clustering, to group similar customers together based on their spending behavior.
-
Principal Component Analysis (PCA): PCA is applied to identify the principal components that best describe the variance in the data and reduce dimensionality.
-
Results: The findings from the analysis, including customer segments and insights gained, are presented in the README and in the project's documentation.
Before running the project, ensure you have the following prerequisites:
- Python
- Jupyter Notebook
-
Clone this repository:
-
Install the required Python packages: To run this project, you need to have the following Python packages installed:
numpy
pandas
matplotlib
seaborn
scikit-learn
You can install these packages using pip:
The EDA phase involves data cleaning, visualization, and summary statistics to gain insights into the dataset. Key visualizations and observations include:
- Histograms and box plots to understand the distribution of spending in each product category.
- Correlation analysis to identify relationships between variables.
- Outlier detection and handling.
The dataset is segmented into clusters using the following methods:
- K-means Clustering: The optimal number of clusters is determined using the Elbow Method, and customers are grouped accordingly.
- Hierarchical Clustering: Clusters are formed based on hierarchical relationships between data points.
- Cluster Interpretation: Each cluster is described, and insights into customer behavior are provided.
PCA is applied to understand the underlying structure of the data and reduce dimensionality. Key components are identified and interpreted.
The key findings and insights from the analysis are presented, including:
- Customer segments and their characteristics.
- Optimal number of clusters.
- Principal components and their interpretations.
Contributions to this project are welcome. If you have suggestions, bug reports, or feature requests, please create an issue or submit a pull request.
Feel free to reach out if you have any questions or feedback about the project.