This repository contains libraries used in the AWS Glue service. These libraries extend Apache Spark with additional data types and operations for ETL workflows. They are used in code generated by the AWS Glue service and can be used in scripts submitted with Glue jobs.
- awsglue -- This Python package includes the Python interfaces to the AWS Glue ETL library.
The Glue ETL jars are now available via the maven build system in a s3 backed maven repository. We use the copy-dependencies target in maven to get all the dependencies needed for glue locally.
Install apache maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
Install the spark distribution from the following location based on the glue version: Glue version 0.9: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-0.9/spark-2.2.1-bin-hadoop2.7.tgz Glue version 1.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
Export SPARK_HOME environment variable to extracted location of the above spark archive. Glue version 0.9: export SPARK_HOME=/home/$USER/spark-2.2.1-bin-hadoop2.7 Glue version 1.0: export SPARK_HOME=/home/$USER/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
The gluepytest script assumes that the pytest module is installed and available in the PATH
Glue shell: ./bin/gluepyspark Glue submit: ./bin/gluesparksubmit pytest: ./bin/gluepytest
The libraries in this repository licensed under the Amazon Software License (the "License"). They may not be used except in compliance with the License, a copy of which is included here in the LICENSE file.