IMPORTANT ❗ ❗ ❗ Please remember to destroy all the resources after each work session. You can recreate infrastructure by creating new PR and merging it to master.
-
Authors:
enter your group nr
link to forked repo
-
Fork https://github.com/bdg-tbd/tbd-2023z-phase1 and follow all steps in README.md.
-
Select your project and set budget alerts on 5%, 25%, 50%, 80% of 50$ (in cloud console -> billing -> budget & alerts -> create buget; unclick discounts and promotions&others while creating budget).
-
From avaialble Github Actions select and run destroy on main branch.
-
Create new git branch and add two resources in
/modules/data-pipeline/main.tf
:-
resource "google_storage_bucket" "tbd-data-bucket" -> the bucket to store data. Set the following properties:
- project // look for variable in variables.tf
- name // look for variable in variables.tf
- location // look for variable in variables.tf
- uniform_bucket_level_access = false #tfsec:ignore:google-storage-enable-ubla
- force_destroy = true
- public_access_prevention = "enforced"
- if checkcov returns error, add other properties if needed
-
resource "google_storage_bucket_iam_member" "tbd-data-bucket-iam-editor" -> assign role storage.objectUser to data service account. Set the following properties:
- bucket // refere to bucket name from tbd-data-bucket
- role // follow the instruction above
- member = "serviceAccount:${var.data_service_account}"
insert the link to the modified file and terraform snippet here
Create PR from this branch to YOUR master and merge it to make new release.
place the screenshot from GA after succesfull application of release with this changes
-
-
Analyze terraform code. Play with terraform plan, terraform graph to investigate different modules.
describe one selected module and put the output of terraform graph for this module here
-
Reach YARN UI
place the port and the screenshot of YARN UI here
-
Draw an architecture diagram (e.g. in draw.io) that includes:
- VPC topology with service assignment to subnets
- Description of the components of service accounts
- List of buckets for disposal
- Description of network communication (ports, why it is necessary to specify the host for the driver) of Apache Spark running from Vertex AI Workbech
place your diagram here
-
Add costs by entering the expected consumption into Infracost
place the expected consumption you entered here
place the screenshot from infracost output here
-
Some resources are not supported by infracost yet. Estimate manually total costs of infrastructure based on pricing costs for region used in the project. Include costs of cloud composer, dataproc and AI vertex workbanch and them to infracost estimation.
place your estimation and references here
what are the options for cost optimization?
-
Create a BigQuery dataset and an external table
place the code and output here
why does ORC not require a table schema?
-
Start an interactive session from Vertex AI workbench (steps 7-9 in README):
place the screenshot of notebook here
-
Find and correct the error in spark-job.py
describe the cause and how to find the error
-
Additional tasks using Terraform:
- Add support for arbitrary machine types and worker nodes for a Dataproc cluster and JupyterLab instance
place the link to the modified file and inserted terraform code
- Add support for preemptible/spot instances in a Dataproc cluster
place the link to the modified file and inserted terraform code
- Perform additional hardening of Jupyterlab environment, i.e. disable sudo access and enable secure boot
place the link to the modified file and inserted terraform code
- (Optional) Get access to Apache Spark WebUI
place the link to the modified file and inserted terraform code