- Learn how to provision computing resources for running Big Data analyses using the Infrastructure as Code (IaC) approach.
- Learn how to set up opinionated CI/CD pipelines to deploy cloud infrastructure.
- Learn how to utilize linters for detecting security vulnerabilities in cloud infrastructure.
- Learn how to run Apache Spark code in a distributed way on Hadoop cluster using Vertex AI notebooks and Dataproc services on GCP.
- Learn how to use Workload Identity Federation for a secure authentication from GitHub Actions to Google Cloud.
- Google Cloud SDK
- gsutil
- pre-commit (optional)
- Terraform ( Requirements )
- Python ~>3.8
- Linux/MacOS
- pre-commit-terraform dependencies (optional)
- Redeem a GCP coupon to create a billing account
- Authenticate to GCP to obtain the default credentials used for running the code
# first remove the stored credentials if exist
gcloud auth application-default revoke
# login and get the new application credentials
gcloud auth application-default login
- Export shared environment variables
export TF_VAR_tbd_semester=2023Z
# format: 20xx for teachers, student ID number for students
export TF_VAR_user_id=9900
# use your own billing account id
export TF_VAR_billing_account=01D435-06DD59-9A00B5
- Enter
bootstrap
folder then init project and Terraform state bucket
cd bootstrap
terraform init
terraform apply
cd ..
- CI/CD (Github Actions setup using Workload Identity Federation)
- Edit
env/backend.tfvars
file and setbucket
variable with the Terraform state bucket - Edit
env/project.tfvars
file and setproject_name
,iac_service_account
variables using the output from thebootstrap
phase, e.g.: - Edit
cicd_bootstrap/conf/github_actions.tfvars
to setgithub_org
andgithub_repo
, e.g.:
github_org = "mwiewior"
github_repo = "tbd-2023z-phase1"
- Init state file and set env variables
cd cicd_bootstrap
terraform init -backend-config=../env/backend.tfvars
- Apply
# authenticate Docker backend with GCP
gcloud auth configure-docker
# create CI/CD integration using Workload Identity
terraform apply -var-file ../env/project.tfvars -var-file conf/github_actions.tfvars -compact-warnings
cd ..
- Use output variables for configuring Github Actions workflow:
.github/workflows/pull-request.yml
,e.g. : Please do not edit and hardcode these values in a YAML but set the Github Actions secrets instead while preserving the secret names, i.e.GCP_WORKLOAD_IDENTITY_PROVIDER_NAME
andGCP_WORKLOAD_IDENTITY_SA_EMAIL
. - Install and configure
pre-commit
(optional)
pre-commit install
- Commit changes, push to a branch and open a PR to YOUR repository main/master branch. If you see a warning like this -- please enable the workflows: ...and repush your changes!
Once all Pull Requests checks have passed please merge your PR and wait until your release job finishes.
-
Navigate to the Vertex AI Workbench menu item, find your notebook on the list, press CONNECT and follow the instructions
-
Check if
pyspark
kernel exists - if not then in your Jupyterlab enviroment add Python3.8 kernel:
python3.8 -m ipykernel install --user --name pyspark
-
Run a
Hello-world
PySpark application in a YARN-client mode: -
IMPORTANT ❗ ❗ ❗ Please remember to destroy all the resources after the work:
terraform init -backend-config=env/backend.tfvars
terraform destroy -no-color -var-file env/project.tfvars
Name | Version |
---|---|
terraform | ~> 1.5.0 |
docker | 3.0.2 |
~> 4.84.0 | |
kubernetes | 2.24.0 |
Name | Version |
---|---|
4.84.0 | |
kubernetes | 2.24.0 |
Name | Source | Version |
---|---|---|
composer | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/composer |
data-pipelines | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/data-pipeline |
dataproc | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/dataproc |
dbt_docker_image | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/dbt_docker_image |
gcr | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/gcr |
jupyter_docker_image | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/jupyter_docker_image |
vertex_ai_workbench | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/vertex-ai-workbench |
vpc | github.com/bdg-tbd/tbd-workshop-1.git | v1.0.36/modules/vpc |
Name | Type |
---|---|
google_compute_firewall.allow-all-internal | resource |
kubernetes_service.dbt-task-service | resource |
google_client_config.provider | data source |
google_container_cluster.composer-gke-cluster | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
ai_notebook_instance_owner | Vertex AI workbench owner | string |
n/a | yes |
project_name | Project name | string |
n/a | yes |
region | GCP region | string |
"europe-west1" |
no |
No outputs.