Hello everyone, in this project, I'll develop a customer segmentation model using BigQuery and other GCP technologies like Google Cloud Storage, Vertex AI and Google Looker Studio.
Customer segmentation is the process of categorizing a business's customers into groups with similar characteristics. These characteristics may include demographic information, shopping habits, or regional location. Through the use of data science tools, this analysis allows businesses to create targeted marketing strategies and personalize services for specific customer groups. This, in turn, can enhance customer satisfaction and provide a competitive advantage for the business.
Let's start.
First of all we should create a Google Cloud Stroage to upload and store the dataset. There are different kind of storage solutions in GCP but we will use the standart one because of the pricing. And after that we upload the data from our local PC to Google Cloud Stroage. You can find the open-source dataset I used in this project. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/data
Our data is now in Google Cloud Stroage, in the next steps we will pull the data from here to BigQuery and Vertex AI Workbench.
In this step we pull the data from Bucket to BigQuery. We'll load the data into BigQuery because we'll be pulling data from BigQuery instead of Bucket when we work on the Workbench notebook.
We loaded the data into BigQuery from Bucket. Now, we can create a K-Means Clustering model using BigQuery if we want. However, in order to show how to use BigQuery and SQL commands on Vertex AI Workbench and because we want to do the data preprocessing steps with python codes, we will be doing all these operations in the Workbench notebook, not on the BigQuery homepage. We will only do the model building phase with BigQuery.
STEP 3: Instance Creation, Load and Describe the Data from BQ to Workbench Notebook and Data Preprocessing
This dataset contains data on the demographic and spending habits of a grocery chain's customers.
Yes, now that we have the necessary information about the dataset, we can move on to the next step.
We will use Vertex AI Workbench for the calculations because of the high computing power capacity offered by Google. Now we need to create an Instance to work on. Since our dataset is not very big, we don't need a lot of processing power capacity so an Instance with 2vCPU and 8GB Ram is enough for us.
Yes we created an instance with 2vCPU and 8GB RAM, now let's create a notebook in the instance we created.
Now we can start writing code.
First of all we need to import the necessary libraries.
Then we create a BigQuery client object to pull the data into the workbench environment, and just like in the BigQuery environment, we pull our data with SQL commands and bring them into a dataframe format. Let's look at the data.
We have about 2200 row and 29 feature. Let's take a look at our numeric variables.
At first glance, it seems that there are outliers in some columns of the data set, and the columns "Z_CostContact" and "Z_Revenue" have a standard deviation of 0 (zero), which means that all values in these columns are equal.
Now we write a simple function to see the missing values, this function will return the number of missing values and percentage of missing values.
Let's see the results.
The 24 missing values in the income column make up only 1.07% of the data. This is a small percentage (not 50%, for example), so we don't need to get rid of the income column. We can replace the missing data with the column's average value. Actually, I won't fill the missing values in the "Income" column with the mean value at this point because we haven't identified and removed any outliers from our dataset. We need to do this first before we can determine whether the values in the "Income" column are symmetrically distributed or not. After we remove any outliers, we will then fill in the missing data with the mean. And on the other hand we will drop the columns "Z_CostContact", "Z_Revenue", because the standard deviation of these two variables is 0, they will not mean anything for our model.
And after that we should check the outliers. There are different methods to identify outliers, in this project I will use the IQR method. Let's see these outliers, I have written a simple function to see outliers, in this function we need to write the values Q1 and Q3.
As seen above we set the first quartile (Q1) and third quartile (Q3) to 0.10 and 0.90, respectively, to widen the range of outlier values. But still we have outliers, we will drop them.
After that we save the summary of clean data to a csv file to review it again.
You can see the files you have saved in the "file browser" on the left side.
Now let us look at the distribution of the "Income" column.
The result:
The Income column shows a distribution similar to the normal distribution, it looks good.
First, let's take a look at how many different values each of the variables in the dataset contains.
The results:
As you can see above, we have lots of different values in some categorical features, it's too much to encoded, that's why we should handle it. Now we will create some new different features.
We have created some new features, let's take a look.
And after that we save the summary of new clean data to a csv file to review it again and upload the Cloud Storage.
In this step we will build a clustering model to segment customers. We will build two different segmentation model. The first one will be product-preferences-based and second one will be transaction-based.
First, we copy the dataframe to be used in the K-Means model.
Now let's choose our features for the product-preferences-based model
Then standardize the product-preferences-based dataset for use in our model
Now we can build the K-Means model. In this section, we can either build our model directly on BigQuery (Figure 1) or we can use the same BigQuery codes and create a BigQuery client and build our model on Vertex AI Workbench (Figure 2).
Figure 1:
Figure 2:
At this point, you may ask why we divided our dataset into 3 different segments. The reason for this is that I first found the optimum number of clusters by using the Elbow method in the same way as my dataset, so we set our cluster parameter to 3.
You can use whatever you want, I will continue on workbench.
Yes, we have created the model. Now let's see the segmentation results for every single customer. For this, we write a BigQuery-SQL commands again and predict the results. You can write commands either on the Workbench (Figure 3) or in the BigQuery interface (Figure 4) to see the results.
Figure 3:
Figure 4:
And save the results to a csv file to review it and import. Then let's look at the top 5 results and how many customers are there in each of these categories?
As you can see, the centroid_id column is the segmentation result for each customer. But there are numbers for each customers and it doesn't look nice, let's replace the numbers with letters.
Yes, we have created the model and now we know the segments for every single single customer.
we will now analyze our results to gain a better understanding of them and present them in a clearer format, such as a presentation, to our team or as needed.
Let's see how many customers we have in each category.
In total, in which category customers preferred which products more.
We should remember that the more crowded a cluster is, the higher its total consumption is likely to be, so let's look at which products each segment prefers more on average.
Finally, let's take a look at the consumption of gold and wine by clients and see how they differ from each other.
We can identify the differences between clusters. At this stage, we now know which customers belong to which cluster and are aware of the product preferences within these clusters. From this point on, we should develop specialized sales, advertising, and marketing activities for these segments. For instance, for a segment with high alcohol and meat consumption, we can channel consumption through techniques such as advertisements, coupons, and other promotional strategies tailored to these products. It is essential to consider the characteristics of each segment and design targeted marketing solutions accordingly.