Skip to content

enesbesinci/CRM-customer-segmentation-using-BigQuery-on-GCP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Customer Segmentation using BigQuery on Workbench on Google Cloud

Hello everyone, in this project, I'll develop a customer segmentation model using BigQuery and other GCP technologies like Google Cloud Storage, Vertex AI and Google Looker Studio.

Customer segmentation is the process of categorizing a business's customers into groups with similar characteristics. These characteristics may include demographic information, shopping habits, or regional location. Through the use of data science tools, this analysis allows businesses to create targeted marketing strategies and personalize services for specific customer groups. This, in turn, can enhance customer satisfaction and provide a competitive advantage for the business.

Let's start.

Customer-Segmentation

STEP 1: Upload the data to Google Cloud Stroage

First of all we should create a Google Cloud Stroage to upload and store the dataset. There are different kind of storage solutions in GCP but we will use the standart one because of the pricing. And after that we upload the data from our local PC to Google Cloud Stroage. You can find the open-source dataset I used in this project. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/data

Screenshot 2023-11-30 173507

Our data is now in Google Cloud Stroage, in the next steps we will pull the data from here to BigQuery and Vertex AI Workbench.

STEP 2: Load the data from Bucket to BigQuery

In this step we pull the data from Bucket to BigQuery. We'll load the data into BigQuery because we'll be pulling data from BigQuery instead of Bucket when we work on the Workbench notebook.

Screenshot 2023-11-30 173949

We loaded the data into BigQuery from Bucket. Now, we can create a K-Means Clustering model using BigQuery if we want. However, in order to show how to use BigQuery and SQL commands on Vertex AI Workbench and because we want to do the data preprocessing steps with python codes, we will be doing all these operations in the Workbench notebook, not on the BigQuery homepage. We will only do the model building phase with BigQuery.

STEP 3: Instance Creation, Load and Describe the Data from BQ to Workbench Notebook and Data Preprocessing

A Summary of The Dataset:

This dataset contains data on the demographic and spending habits of a grocery chain's customers.

Description of Variables:

Variables About Customers

Screenshot 2023-11-30 211504

Variables About Products and Promotions

Screenshot 2023-11-30 211519

Other Variables

Screenshot 2023-11-30 211527

Yes, now that we have the necessary information about the dataset, we can move on to the next step.

We will use Vertex AI Workbench for the calculations because of the high computing power capacity offered by Google. Now we need to create an Instance to work on. Since our dataset is not very big, we don't need a lot of processing power capacity so an Instance with 2vCPU and 8GB Ram is enough for us.

Screenshot 2023-11-30 173818

Yes we created an instance with 2vCPU and 8GB RAM, now let's create a notebook in the instance we created.

Screenshot 2023-11-30 195414

Now we can start writing code.

Starting to Code...

First of all we need to import the necessary libraries.

Screenshot 2023-11-30 195639

Then we create a BigQuery client object to pull the data into the workbench environment, and just like in the BigQuery environment, we pull our data with SQL commands and bring them into a dataframe format. Let's look at the data.

Screenshot 2023-11-30 195701

We have about 2200 row and 29 feature. Let's take a look at our numeric variables.

Screenshot 2023-11-30 200301 Screenshot 2023-11-30 200332

At first glance, it seems that there are outliers in some columns of the data set, and the columns "Z_CostContact" and "Z_Revenue" have a standard deviation of 0 (zero), which means that all values in these columns are equal.

Now we write a simple function to see the missing values, this function will return the number of missing values and percentage of missing values.

Screenshot 2023-11-30 200623

Let's see the results.

Screenshot 2023-11-30 200638 Screenshot 2023-11-30 200707

The 24 missing values in the income column make up only 1.07% of the data. This is a small percentage (not 50%, for example), so we don't need to get rid of the income column. We can replace the missing data with the column's average value. Actually, I won't fill the missing values in the "Income" column with the mean value at this point because we haven't identified and removed any outliers from our dataset. We need to do this first before we can determine whether the values in the "Income" column are symmetrically distributed or not. After we remove any outliers, we will then fill in the missing data with the mean. And on the other hand we will drop the columns "Z_CostContact", "Z_Revenue", because the standard deviation of these two variables is 0, they will not mean anything for our model.

And after that we should check the outliers. There are different methods to identify outliers, in this project I will use the IQR method. Let's see these outliers, I have written a simple function to see outliers, in this function we need to write the values Q1 and Q3.

Screenshot 2023-11-30 201502

As seen above we set the first quartile (Q1) and third quartile (Q3) to 0.10 and 0.90, respectively, to widen the range of outlier values. But still we have outliers, we will drop them.

Screenshot 2023-11-30 205749

After that we save the summary of clean data to a csv file to review it again.

Screenshot 2023-11-30 215049

You can see the files you have saved in the "file browser" on the left side.

Screenshot 2023-11-30 215323

Now let us look at the distribution of the "Income" column.

Screenshot 2023-11-30 215152

The result:

Screenshot 2023-11-30 215618

The Income column shows a distribution similar to the normal distribution, it looks good.

Feature Engineering:

First, let's take a look at how many different values each of the variables in the dataset contains.

Screenshot 2023-11-30 220051

The results:

Screenshot 2023-11-30 220110

Screenshot 2023-11-30 220143

As you can see above, we have lots of different values in some categorical features, it's too much to encoded, that's why we should handle it. Now we will create some new different features.

Screenshot 2023-12-05 152929 Screenshot 2023-12-05 152947

We have created some new features, let's take a look.

Screenshot 2023-12-05 153047

And after that we save the summary of new clean data to a csv file to review it again and upload the Cloud Storage.

Screenshot 2023-12-05 153028

STEP 4: Build a K-Means Clustering Model using BigQuery on Workbench

In this step we will build a clustering model to segment customers. We will build two different segmentation model. The first one will be product-preferences-based and second one will be transaction-based.

First, we copy the dataframe to be used in the K-Means model.

Screenshot 2023-12-05 153400

Now let's choose our features for the product-preferences-based model

Screenshot 2023-12-05 153644

Then standardize the product-preferences-based dataset for use in our model

Screenshot 2023-12-05 154446

Now we can build the K-Means model. In this section, we can either build our model directly on BigQuery (Figure 1) or we can use the same BigQuery codes and create a BigQuery client and build our model on Vertex AI Workbench (Figure 2).

Figure 1:

Screenshot 2023-12-05 160228

Figure 2:

Screenshot 2023-12-05 154534

At this point, you may ask why we divided our dataset into 3 different segments. The reason for this is that I first found the optimum number of clusters by using the Elbow method in the same way as my dataset, so we set our cluster parameter to 3.

Screenshot 2023-12-05 171436 Screenshot 2023-12-05 171453

You can use whatever you want, I will continue on workbench.

Yes, we have created the model. Now let's see the segmentation results for every single customer. For this, we write a BigQuery-SQL commands again and predict the results. You can write commands either on the Workbench (Figure 3) or in the BigQuery interface (Figure 4) to see the results.

Figure 3:

Screenshot 2023-12-05 154553

Figure 4:

Screenshot 2023-12-05 161326

And save the results to a csv file to review it and import. Then let's look at the top 5 results and how many customers are there in each of these categories?

Screenshot 2023-12-05 160835

As you can see, the centroid_id column is the segmentation result for each customer. But there are numbers for each customers and it doesn't look nice, let's replace the numbers with letters.

Screenshot 2023-12-05 161608

Yes, we have created the model and now we know the segments for every single single customer.

we will now analyze our results to gain a better understanding of them and present them in a clearer format, such as a presentation, to our team or as needed.

Let's see how many customers we have in each category.

Screenshot 2023-12-05 162426

In total, in which category customers preferred which products more.

Screenshot 2023-12-05 162450

We should remember that the more crowded a cluster is, the higher its total consumption is likely to be, so let's look at which products each segment prefers more on average.

Screenshot 2023-12-05 162708

Finally, let's take a look at the consumption of gold and wine by clients and see how they differ from each other.

Screenshot 2023-12-05 163004

We can identify the differences between clusters. At this stage, we now know which customers belong to which cluster and are aware of the product preferences within these clusters. From this point on, we should develop specialized sales, advertising, and marketing activities for these segments. For instance, for a segment with high alcohol and meat consumption, we can channel consumption through techniques such as advertisements, coupons, and other promotional strategies tailored to these products. It is essential to consider the characteristics of each segment and design targeted marketing solutions accordingly.

About

We will build a customer segmentation model (K-Means) using BigQuery on Google Cloud

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published