Skip to content

aeronaut2001/Marketing-Campaign-Data-Analysis

Repository files navigation

Marketing-Campaign-Data-Analysis

aeronaut2001

View My Profile View Repositories


Marketing-Campaign-Data-Analysis Using Apache Spark💥🐝

📝 Gain the skills

Languages and Tools:

Cloud:

gcp

Version Control System:

git

Programming Language - PYTHON:

python

BIG DATA TOOL AND SOFTWARES:

hadoop Apache Hive Apache Spark linux


📙 Project Structures :

  • Project Introduction:

  • So, I had this project where I wanted to analyze some marketing campaign data. I decided to use Apache Spark, specifically PySpark, and I did this in the cloud using Google Cloud Platform (GCP).

  • Data Loading:

  • Loading Data into HDFS:

    • To get started, I needed to bring in the data. So, I loaded three JSON files - ad_campaigns_data.json, user_profile_data.json, and store_data.json into HDFS on GCP.
  • [x]PySpark Data Analysis:

  • Analyzing Data with PySpark:

    • This was the exciting part. I used a Jupyter notebook and wrote PySpark code to tackle some specific analytical challenges. Here's what I did:
      • I delved into the data for each campaign_id, date, hour, os_type, and value to gather all the events and count them.
      • I repeated this process for campaign_id, date, hour, store_name, and value.
      • I also explored data for campaign_id, date, hour, gender_type, and value, gathering event counts.
  • Data Storage:

  • Storing Processed Data:

    • To keep things organized, I stored the processed JSON data from each of these analytical problems in separate HDFS output directories.
  • Hive Table Creation:

  • Creating Hive Tables:

    • Once I had the output data comfortably sitting in HDFS, I took the next step. I created external Hive tables. These tables allowed me to easily run SQL-like queries on the data. I used Json Serde, a serializer/deserializer, to make this happen.

So, that's the breakdown of my project. It involved data loading, PySpark analysis, data storage, and creating Hive tables for convenient querying.

  • Key Takeaway:
  • This project demonstrated the practical application of Apache Spark (PySpark) in a cloud environment, using Google Cloud Platform (GCP) to analyze marketing campaign data. The project highlighted the crucial stages of data loading, PySpark analysis, organized data storage, and the creation of external Hive tables for effective data querying. It showcases the power of big data tools and cloud computing in solving real-world analytical challenges.

About

Marketing Campaign Data Analysis Using Apache Spark (PySpark)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published