AWS Glue provides a serverless Spark execution environment. The Spark UI via. a Spark history server can help with performance tuning jobs.
Glue does not provide a Spark history server. However, Glue jobs can produce Spark UI logs in an S3 bucket. You can subsequently host your own Spark history server that visualizes the Spark UI logs directly from the S3 bucket Glue logs in.
The instructions below are for Windows. It uses a docker container to run the Spark history server and directly consumes Spark UI logs from an S3 bucket in real time.
-
Enable Spark UI logs for a Glue job if it is not already enabled. Instructions for how to enable the Spark web UI for glue jobs are provided in the AWS documentation
-
Download the required docker files that correspond to the version of Glue
- Build the Docker image by running the following from the command line in the folder you downloaded the docker files in
docker build -t glue/sparkui:latest .
- Set the following environment variables from the command line
- Replace the <S3_BUCKET_PATH_TO_SPARK_UI_LOGS> with the name of the S3 bucket and path to the folder that contains the Spark UI logs from the Glue job(s)
set LOG_DIR="s3a://<S3_BUCKET_PATH_TO_SPARK_UI_LOGS>"
- Replace <AWS_ACCESS_KEY_ID> and <AWS_SECRET_ACCESS_KEY> with the access key id and secret access key for a user that has access to read the Spark UI files from the S3 bucket
set AWS_ACCESS_KEY_ID="<AWS_ACCESS_KEY_ID>"
set AWS_SECRET_ACCESS_KEY="<AWS_SECRET_ACCESS_KEY>"
- Create the docker container running the following from the command line
docker run -itd -e SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=%LOG_DIR% -Dspark.hadoop.fs.s3a.access.key=%AWS_ACCESS_KEY_ID% -Dspark.hadoop.fs.s3a.secret.key=%AWS_SECRET_ACCESS_KEY%" -p 18080:18080 --name sparkui glue/spark:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
- The Spark UI will be avaiable at http://localhost:18080/