Hi everyone! Welcome to the official documentation page for terraglue, an open source Terraform module developed in order to provide an easy way to deploy a Glue job in any AWS account.
- Are you using Glue for the first time and want to see an end to end ETL example in AWS?
- Do you already have a Spark application and want to deploy it as a Glue job in AWS?
- Do you want to automate the Glue job setup using an IaC tool such as Terraform?
- Have you ever wanted to go the next level on developing Glue jobs?
Note Now the terraglue project has an official documentation in readthedocs! Visit the following link and check out usability technical details, practical examples and more!
- ✌️ Available in two different operation modes: "learning" and "production"
- 🤖 Possibility to deploy a preconfigured Glue job with a complete end-to-end ETL example when using "learning" mode
- 🚀 Possibility to deploy a custom Glue job according to user needs when using "production" mode
- 👉 Have your Glue job ready and running at the touch of a Terraform module call
When terraglue module is called in a Terraform project, an operation mode must be chosen. There are two options: "learning" mode and "production" mode. According to this decision, different things can happen in the target AWS account.
The learning mode helps users to understand more about Glue jobs on AWS by providing a complete example with all resources needed to start exploring Glue. It works as following:
🤖 Learning mode
- A sample pyspark application is uploaded in a given S3 bucket to be the main script for the Glue job
- An auxiliar python file is also uploaded in S3 with useful transformation functions for the job
- An IAM role is created with basic permissions to run a Glue job
- A KMS key is created to be used in the job security configuration
- Finally, a preconfigured Glue job is deployed in order to provide users a example of a SoT table creation using Brazilian E-Commerce data from datadelivery
By the other hand, the production mode enables users to configure and deploy their own Glue jobs in AWS. The under the hood operation depends on how users configure variables on module call. In summary, it works as following:
🚀 Production mode
- In this mode, users have the chance to use all the terraglue module variables to customize the deploy
- A custom Glue job is deployed in the target AWS account using the variables passed by users on module call
The terraglue Terraform module isn't alone. There are other complementary open source solutions that can be put together to enable the full power of learning analytics on AWS. Check it out if you think they could be useful for you!
AWS Glue
- AWS - Glue Official Page
- AWS - Jobs Parameters Used by AWS Glue
- AWS - GlueContext Class
- AWS - DynamicFrame Class
- Stack Overflow - Job Failing by Job Bookmark Issue - Empty DataFrame
- AWS - Calling AWS Glue APIs in Python
- AWS - Using Python Libraries with AWS Glue
- Spark Temporary Tables in Glue Jobs
- Medium - Understanding All AWS Glue Import Statements and Why We Need Them
- AWS - Develop and test AWS Glue jobs Locally using Docker
- AWS - Creating OpenID Connect (OIDC) identity providers
Terraform
- Terraform - Hashicorp Terraform
- Terraform - Conditional Expressions
- Stack Overflow - combine "count" and "for_each" on Terraform
Apache Spark
- SparkByExamples - Pyspark Date Functions
- Spark - Configuration Properties
- Stack Overflow - repartition() vs coalesce()
GitHub
- Conventional Commits
- Semantic Release
- GitHub - Angular Commit Message Format
- GitHub - commitlint
- shields.io
- Codecoverage - docs
- GitHub Actions Marketplace
- Continuous Integration with GitHub Actions
- GitHub - About security hardening with OpenID Connect
- GitHub - Securing deployments to AWS from GitHub Actions with OpenID Connect
- GitHub - Workflow syntax for GitHub Actions
- Eduardo Mendes - Live de Python #170 - GitHub Actions
Docker
- GitHub Docker Run Action
- Using Docker Run inside of GitHub Actions
- Stack Overflow - Unable to find region when running docker locally
Testes
- Eduardo Mendes - Live de Python #167 - Pytest: Uma Introdução
- Eduardo Mendes - Live de Python #168 - Pytest Fixtures
- Databricks - Data + AI Summit 2022 - Learn to Efficiently Test ETL Pipelines
- Real Python - Getting Started with Testing in Python
- Inspired Python - Five Advanced Pytest Fixture Patterns
- getmoto/moto - mock inputs
- Codecov - Do test files belong in code coverage calculations?
- Jenkins Issue: Endpoint does not contain a valid host name
Outros