This project is a basic demo of how to run big data analytic using PySpark running on the AWS ElasticMapReduce (EMR) Serverless service.
IMPORTANT: before using the project, make sure you have completed all the Prerequisites.
To run analytics, just run:
my\path\to\pyspark-emr-serverless> python main.py
The main.py
script will perform the following tasks:
- create an AWS EMR Serverless Application (EMR app')
- upload the Spark entrypoint
spark_job.py
from your computer to AWS S3 bucket (my-output-bucket
) - upload required Python files (listed in configuration file) from your computer to AWS S3 bucket
- submit PySpark job on the EMR app'
- PySpark job reads input data from AWS S3 bucket
my-input-bucket
- PySpark job writes logs to AWS S3 bucket
my-output-bucket
- PySpark job writes output data to AWS S3 bucket
my-output-bucket
- PySpark job reads input data from AWS S3 bucket
- wait for PySpark job to finish
- fetch PySpark logs from AWS S3 and display them on the console
- close and delete the EMR app'
When the job is completed, the analytics output data is available in the output S3 bucket my-output-bucket
.
You can control Spark job configuration by editing the conf.yaml
file.
Prerequisites :
python==3.9
andpip
are installed on your computer.git
is installed on your computer.
Steps
- clone the content of this Git repository:
> git clone git@gitlab.com:data-terrae/projects/captchacracker.git
- open a terminal, and move to the project's root folder:
> cd my\path\to\pyspark-emr-serverless
- create and activate a virtual environment for the project:
my\path\to\pyspark-emr-serverless> python -m venv myenv
my\path\to\pyspark-emr-serverless> myenv\Scripts\activate
(myenv) my\path\to\pyspark-emr-serverless>
- install Python libraries that are required to run the project:
(myenv) my\path\to\pyspark-emr-serverless> pip install -r requirements.txt
This projects uses AWS cloud services to store data and run analytics. In order to run the project, you need to have an active AWS account. See here to create one if needed.
To use EMR Serverless, you need an AWS IAM user with an attached policy that grants permissions for EMR Serverless. To create a user and attach the appropriate policy to that user, follow the instructions in Create a user and grant permissions.
The project uses AWS S3 to store input data as well as analytics outputs and logs. Before running the project, you must:
-
Upload input data to AWS S3
- create an AWS S3 bucket with custom name (e.g.,
my-input-bucket
) - manually upload the input dataset to this S3 bucket
- create an AWS S3 bucket with custom name (e.g.,
-
Create a second AWS S3 bucket (e.g.,
my-output-bucket
), that will later be used to:- upload Python files from your computer
- store PySpark logs
- store analytics output data
To create a bucket, follow the instructions in Creating a bucket in the Amazon Simple Storage Service Console User Guide.
Once you have created the two AWS S3 buckets, you can edit the project configuration file conf.yaml
with the appropriate paths:
- `spark_job.arguments`: full S3 URIs to analytics input/output data folders
- `spark_job.s3_bucket_name`: name of the second S3 bucket (for logging and outputs)
The project uses the AWS ElasticMapReduce (EMR) Serverless service to parallelize data processing with PySpark. In order to run the project, you need to have:
-
an AWS IAM user with appropriate permissions:
- full access the AWS EMR service API
- read/write access to the AWS S3 buckets used in the project
-
an AWS credentials file stored on your computer (
~/.aws/credentials
), that contains the user access keys, for example:
~/.aws/credentials
[default] aws_access_key_id=AKIAIOSFODNN7EXAMPLE aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
The boto3
Python library will read these AWS credentials to access your AWS resources.
See boto3 documentation
for further guidance on how to manage AWS credentials.
Job runs in AWS EMR Serverless use a runtime role that provides granular permissions to specific AWS services and resources at runtime.
In this tutorial, a first S3 bucket my-input-bucket
hosts the input data.
The bucket my-output-bucket
stores the job Python files, logs and output data.
To set up a job runtime role, first create a runtime role with a trust policy so that EMR Serverless can use the new role. Next, attach the required S3 access policy to that role. The following steps guide you through the process:
- Navigate to the IAM console at https://console.aws.amazon.com/iam/
- In the left navigation pane, choose Roles.
- Choose Create role.
- For role type, choose Custom trust policy and paste the following trust policy. This allows jobs submitted to your Amazon EMR Serverless applications to access other AWS services on your behalf.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "emr-serverless.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
- Choose Next to navigate to the Add permissions page, then choose Create policy.
- The Create policy page opens on a new tab. Paste the policy JSON below.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "ReadAccessForEMRSamples", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::*.elasticmapreduce", "arn:aws:s3:::*.elasticmapreduce/*" ] }, { "Sid": "FullAccessToOutputBucket", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::my-input-bucket", "arn:aws:s3:::my-input-bucket/*", "arn:aws:s3:::my-output-bucket", "arn:aws:s3:::my-output-bucket/*", ] }, { "Sid": "GlueCreateAndReadDataCatalog", "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["*"] } ] }
- On the Review policy page, enter a name for your policy, such as EMRServerlessS3AndGlueAccessPolicy.
- Refresh the Attach permissions policy page, and choose EMRServerlessS3AndGlueAccessPolicy.
- In the Name, review, and create page, for Role name, enter a name for your role, for example, EMRServerlessS3RuntimeRole. To create this IAM role, choose Create role.
When you launch a PySpark job on the AWS EMR Serverless service, your Python code runs on AWS EC2 instances
with Amazon Linux operating system, with a number of pre-installed applications, including Python and PySpark library.
AWS gives the list of application versions for each release of Amazon EMR.
The AWS EMR release version is given in the project configuration file conf.yaml
:
emr.app.release_label: "emr-6.6.0"
.
If you need to use extra Python libraries to run your Spark job, one option is to pass a Python Virtual Environment (Venv) to your PySpark job. Here are the steps to follow:
- Create and pack a Python Virtual Environment
We use venv
and venv-pack
to create and pack a virtual environment with the required libraries.
To be sure that the virtual environment will run on the AWS EC2 instances, we use Docker to create the virtual environment
on top of an amazonlinux
base image:
- open a terminal, and move to the
pyspark_env
folder:
> cd my\path\to\pyspark-emr-serverless\pyspark_env
my\path\to\pyspark-emr-serverless\pyspark_env>
-
edit the
Dockerfile
file and add extra Python libraries that are required by your Pyspark job -
build and pack the Python virtual environment using the Docker container:
my\path\to\pyspark-emr-serverless\pyspark_env> docker build --output . .
--> The packed virtual environment is written to pyspark_venv.tar.gz
archive file.
- Upload the packed Python Virtual Environment to AWS S3
The packed virtual environment must be uploaded to AWS S3, so that your PySpark job can access it.
Manually upload the pyspark_venv.tar.gz
archive file to AWS S3 storage, for example to s3://my-output-bucket/pyspark_env/pyspark_venv.tar.gz
- Edit Spark job configuration
You now need to tell your PySpark job to use the Python environment you just created.
To do so, you can edit the conf.yaml
project configuration file, and add the following conf:
emr:
spark_job:
# Python virtual environment to be used by Spark job
spark.archives: "s3://my-output-bucket/pyspark_env/pyspark_venv.tar.gz#environment"
spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON: "./environment/bin/python"
spark.emr-serverless.driverEnv.PYSPARK_PYTHON: "./environment/bin/python"
spark.emr-serverless.executorEnv.PYSPARK_PYTHON: "./environment/bin/python"