Skip to content

Latest commit

 

History

History
93 lines (49 loc) · 3.02 KB

README.md

File metadata and controls

93 lines (49 loc) · 3.02 KB

Batch Pipeline On Docker To Easily Know Customer Purchasing Behaviors

Business Case

Our customers (subscribers) seek help to build skills to deploy simple and viable batch pipelines entirely on Docker involving the following relational and NoSQL databases:

  • Cassandra
  • MySQL
  • Redis

Results

I successfully engineered 3 batch data processing pipelines with PySpark while having the databases entirely on Docker.

I ingested, pre-processed and visualized the data in these databases to validate their successful deployment.

I also analyzed customer purchasing behavior.

Deployment

I plan to write a blog post about how to deploy these 3 batch pipelines on Docker soon. Stay tuned!

Data

I chose the eCommerce behavior data from multi category store available on Kaggle to focus on successfully implementing the 3 batch pipelines.

Real business data requires more pre-processing than the transformations I performed with this data.

Properties of data

Data file contains customer behavior data on a large multi-category online store's website for 1 month (November 2019).

Each row in the file represents an event.

  • All events are related to products and users

  • There are 3 different types of events → view, cart and purchase

The 2 purchase funnels are

  • view → cart → purchase
  • view → purchase

Here's the distribution of events in the data:

Event Types

Batch Pipelines on Docker

Implementation

Batch Pipelines Implementation

Storage

Cassandra

Cassandra

MySQL

MySQL

Redis

Redis

Analysis

I performed the following analyses on the pre-processed (transformed) data in storage

  • Views by category

Views By Category

  • Purchase category vs Volume

Purchase category vs Volume

  • Top 20 brands purchased

Top 20 Brands Purchased

  • Purchase conversion volume

Purchase Conversion Volume

Acknowledgement

All data, I based my analysis on, is collected by and belongs to Open CDP project.

Connect with me

Prakash Dontaraju LinkedIn Twitter Medium