Cloud:
Version Control System:
Programming Language - PYTHON:
BIG DATA TOOL AND SOFTWARES:
-
Project Introduction:
-
So, I had this project where I wanted to analyze some marketing campaign data. I decided to use Apache Spark, specifically PySpark, and I did this in the cloud using Google Cloud Platform (GCP).
-
Data Loading:
-
Loading Data into HDFS:
- To get started, I needed to bring in the data. So, I loaded three JSON files -
ad_campaigns_data.json
,user_profile_data.json
, andstore_data.json
into HDFS on GCP.
- To get started, I needed to bring in the data. So, I loaded three JSON files -
-
[x]PySpark Data Analysis:
-
Analyzing Data with PySpark:
- This was the exciting part. I used a Jupyter notebook and wrote PySpark code to tackle some specific analytical challenges. Here's what I did:
- I delved into the data for each campaign_id, date, hour, os_type, and value to gather all the events and count them.
- I repeated this process for campaign_id, date, hour, store_name, and value.
- I also explored data for campaign_id, date, hour, gender_type, and value, gathering event counts.
- This was the exciting part. I used a Jupyter notebook and wrote PySpark code to tackle some specific analytical challenges. Here's what I did:
-
Data Storage:
-
Storing Processed Data:
- To keep things organized, I stored the processed JSON data from each of these analytical problems in separate HDFS output directories.
-
Hive Table Creation:
-
Creating Hive Tables:
- Once I had the output data comfortably sitting in HDFS, I took the next step. I created external Hive tables. These tables allowed me to easily run SQL-like queries on the data. I used Json Serde, a serializer/deserializer, to make this happen.
So, that's the breakdown of my project. It involved data loading, PySpark analysis, data storage, and creating Hive tables for convenient querying.
- Key Takeaway:
- This project demonstrated the practical application of Apache Spark (PySpark) in a cloud environment, using Google Cloud Platform (GCP) to analyze marketing campaign data. The project highlighted the crucial stages of data loading, PySpark analysis, organized data storage, and the creation of external Hive tables for effective data querying. It showcases the power of big data tools and cloud computing in solving real-world analytical challenges.