This project demonstrates the process of big data analysis using AWS services, focusing on filtering and analyzing the Wikiticker dataset. Utilizing technologies such as Amazon EMR, S3, Glue, and Athena, it showcases an end-to-end pipeline from data processing with Spark to data storage, cataloging, and querying.
AWS Big Data Processing
├── Code/
│ └── filter.py # Spark job script for processing the dataset
├── Data/
│ ├── datatypes.json # Schema definition for AWS Glue catalog table
│ └── wikiticker-2015-09-12-sampled.json # Sampled Wikiticker dataset for analysis
└── Project Documentation.pdf # Detailed project documentation
- AWS account with access to EMR, S3, Glue, and Athena services.
- AWS CLI installed and configured.
-
Prepare the Data: Upload the
wikiticker-2015-09-12-sampled.json
file to your S3 bucket. -
Launch an EMR Cluster: Refer to the
Project Documentation.pdf
for detailed instructions on setting up the EMR cluster. -
Run the Spark Job:
-
SSH into the EMR master node.
-
Use
vi
to create and editfilter.py
directly on the node:vi filter.py
-
Insert the Spark script content into
filter.py
. Exit and save the file by typing:wq!
. -
Execute the script using Spark-submit:
spark-submit filter.py
-
-
Catalog the Data: Use the provided
datatypes.json
to create a schema in AWS Glue for the filtered dataset. -
Query with Athena: Following the setup in Glue, use Athena to execute queries against your data.
Ensure to terminate the EMR cluster and delete any unused resources in S3 to avoid unnecessary charges.
For detailed instructions, configuration options, and best practices, refer to the Project Documentation.pdf
included in this repository.
The following resources provide foundational lab exercises that inspired the tasks and structure of this project:
- Spark Job for Filtering and Processing Wikiticker Data: Details the tasks in developing a Spark job for data filtering, similar to the approach taken in this project.
- Create Glue Catalog Table and Query Data in AWS Athena: Details the process of creating a Glue catalog table and using Athena for querying, as implemented in the workflow of this project.