This project focuses on segmenting credit card customers based on their spending patterns and behaviors. The goal is to help a financial institution develop targeted marketing strategies and better understand customer needs using unsupervised machine learning techniques.
We performed clustering using K-Means after reducing the dimensionality of the dataset with Principal Component Analysis (PCA). The project involved multiple steps, including data preprocessing, feature engineering, model building, and validation.
- Data Overview
- Data Preprocessing
- Dimensionality Reduction (PCA)
- Clustering with K-Means
- Model Validation
- Results and Interpretation
- Technical Skills
The dataset contains behavioral data on 9,000 active credit card holders over six months. The key columns include:
- CUST_ID: Unique identifier for each customer.
- BALANCE: Current balance on the credit card.
- PURCHASES: Total amount spent using the card.
- ONEOFF_PURCHASES: Largest single purchase made.
- INSTALLMENTS_PURCHASES: Total amount spent on installment-based purchases.
- CASH_ADVANCE: Total cash withdrawn using the card.
- PURCHASES_FREQUENCY: Frequency of purchases.
- CREDIT_LIMIT: Maximum limit on the credit card.
- PAYMENTS: Total amount paid by the customer.
- TENURE: Duration of the customer’s relationship with the credit card company.
To ensure the quality of the input data, the following preprocessing steps were implemented:
We identified missing values in the MINIMUM_PAYMENTS and CREDIT_LIMIT columns. The missing data was imputed using:
- KNN Imputer: This method imputed missing values based on the closest neighbors, ensuring the imputed values were aligned with the overall customer profiles.
The features had different units and scales, which could negatively impact the clustering model. To standardize the features:
- StandardScaler: We applied standardization to ensure each feature had a mean of 0 and a standard deviation of 1, making them comparable.
Certain features like CASH_ADVANCE and BALANCE exhibited right skewness. We applied logarithmic transformation to reduce skewness and improve model performance.
With 18 features, there was a need to reduce dimensionality for better visualization and model performance. We applied Principal Component Analysis (PCA):
- Explained Variance: We selected the number of components based on the explained variance ratio. The first 3 components explained around 51.33% of the variance.
- Scree Plot: The scree plot helped us decide on the optimal number of components.
To enhance the model's ability to segment customers, we created derived features:
- Balance to Credit Ratio: Indicates how much of the credit limit is being utilized.
- Purchase to Payment Ratio: Provides insights into customer repayment behavior.
To segment customers, we used K-Means Clustering, an unsupervised machine learning algorithm.
-
Elbow Method: We used the elbow method to determine the optimal number of clusters by plotting the sum of squared distances (inertia) against the number of clusters. The elbow point was found at 7 clusters.
-
Silhouette Score: The average silhouette score was used to evaluate how well-separated the clusters were. A silhouette score of 0.42 indicated reasonable separation between clusters, signifying that the clustering was meaningful.
We trained the K-Means model with k=7 clusters based on the elbow method. The clusters were assigned based on customer spending patterns across various features.
In addition to K-Means, we experimented with DBSCAN (Density-Based Spatial Clustering of Applications with Noise), a powerful clustering algorithm suited for identifying clusters of varying shapes and sizes, as well as outliers. DBSCAN groups points based on the density of their neighborhood. It requires two key parameters: epsilon (the maximum distance between two points for them to be considered neighbors) and min_samples (the minimum number of points to form a cluster).
Through several trials of tuning epsilon and min_samples, we found that DBSCAN was useful for detecting dense regions and outliers, but due to the evenly spread customer data, it resulted in many points being labeled as noise.
We also explored Agglomerative Clustering, a hierarchical clustering technique that builds clusters in a bottom-up manner. Each point starts as its own cluster, and pairs of clusters are merged based on proximity, using linkage criteria such as ward linkage, average linkage, or complete linkage.
The strength of Agglomerative Clustering lies in the ability to visualize clustering results using dendrograms, which allowed us to observe how data points merged into clusters at different distances. Though it is more computationally expensive than K-Means, Agglomerative Clustering provided insight into the hierarchical structure of the data and relationships between the clusters, offering a unique perspective for segmentation.
The model was validated using the following techniques:
- Silhouette Score: Provided an average score of 0.42, showing reasonable separation between clusters.
- Cluster Visualization: We used 2D scatter plots and principal components to visualize the clusters and their separation.
The K-Means clustering algorithm segmented customers into 7 clusters, each representing distinct spending behaviors.
- These customers prefer frequent installment payments and rarely make large one-time purchases.
- Strategy: Offer personalized installment plans and loyalty programs to retain them.
- Customers who favor large one-time purchases and seldom use installment options.
- Strategy: Incentivize upfront payments through limited-time offers and discounts.
- A balanced approach between one-time and installment payments.
- Strategy: Educate customers on the benefits of installment options.
- Customers who frequently withdraw cash advances and prefer installment payments.
- Strategy: Introduce cash-back incentives for installment purchases to reduce cash dependency.
- A mixture of one-time and installment payments with moderate cash advances.
- Strategy: Offer flexible payment plans and product bundles.
- Heavy reliance on cash withdrawals with occasional installment purchases.
- Strategy: Promote financial literacy and responsible spending habits.
- Frequent installment users with a moderate preference for cash withdrawals.
- Strategy: Implement exclusive membership programs with targeted promotions.
- Python: For implementing the entire workflow, including data preprocessing and modeling.
- Pandas & NumPy: For data manipulation, feature engineering, and numerical operations.
- Scikit-Learn: For applying PCA, K-Means, and evaluation metrics like the silhouette score.
- Matplotlib & Seaborn: For visualizing the scree plot, elbow plot, and cluster separations.
- KNN Imputer: For imputing missing values in critical features.
- StandardScaler: For feature standardization.
- K-Means Clustering: For customer segmentation.
- Principal Component Analysis (PCA): For dimensionality reduction.
This project successfully segmented credit card customers based on their spending patterns, providing actionable insights for targeted marketing. The combination of PCA and K-Means Clustering helped in identifying distinct customer behaviors, enabling personalized marketing strategies. This segmentation can help financial institutions better serve their customers by aligning their products with customer needs and preferences.