Skip to content

Explore the capabilities of Amazon EMR Serverless by processing semi-structured review data with Apache Spark, showcasing efficient big data analysis without managing clusters.

License

Notifications You must be signed in to change notification settings

kevinndungu-source/Amazon_EMR_Serverless_Demonstration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amazon Elastic Map Reduce (EMR) Serverless Demonstration


Amazon-EMR-Serverless

This project showcases the utilization of Amazon EMR Serverless for running a sample Spark job to process semi-structured review data. The goal is to demonstrate the capabilities of Amazon EMR Serverless in efficiently processing and analyzing big data workloads. Overview Amazon EMR (Elastic MapReduce) Serverless is a serverless big data processing service that enables you to run Apache Spark applications without managing clusters. In this demonstration, we leverage EMR Serverless to process semi-structured review data stored in JSON format and derive insights from the analysis.


Project Structure

1. Scripts:

  • reviews.py: Python script for processing the review data.
  • script_arguments: Additional script arguments used during the EMR Serverless application setup.

2. Sample Dataset:

  • dataset_en_dev.json: Semi-structured review data in JSON format.

How to Use

1. Setup Amazon EMR Serverless:

  • Configure an S3 bucket to store output files and logs.
  • Create an IAM role with appropriate permissions for EMR Serverless.

2. Run Spark Job:

  • Execute the sample Spark job using Amazon EMR Serverless.
  • Provide necessary script arguments during application setup.

3. Analyze Data with Amazon Athena:

  • Link Amazon Athena to the output folder in the S3 bucket containing processed Parquet data.
  • Run SQL queries in Amazon Athena to analyze the processed data and derive insights.

Additional Resources

  1. For detailed documentation and insights, refer to this project's documentation document link.
  2. To replicate the project or explore the code, refer to this GitHub repository code section.

About

Explore the capabilities of Amazon EMR Serverless by processing semi-structured review data with Apache Spark, showcasing efficient big data analysis without managing clusters.

Topics

Resources

License

Stars

Watchers

Forks

Languages