IMPORTANT ❗ ❗ ❗ Please remember to destroy all the resources after each work session. You can recreate infrastructure by creating new PR and merging it to master.

Authors:

enter your group nr

link to forked repo
Fork https://github.com/bdg-tbd/tbd-2023z-phase1 and follow all steps in README.md.
Select your project and set budget alerts on 5%, 25%, 50%, 80% of 50$ (in cloud console -> billing -> budget & alerts -> create buget; unclick discounts and promotions&others while creating budget).

From avaialble Github Actions select and run destroy on main branch.
Create new git branch and add two resources in /modules/data-pipeline/main.tf:
1. resource "google_storage_bucket" "tbd-data-bucket" -> the bucket to store data. Set the following properties:
  - project // look for variable in variables.tf
  - name // look for variable in variables.tf
  - location // look for variable in variables.tf
  - uniform_bucket_level_access = false #tfsec:ignore:google-storage-enable-ubla
  - force_destroy = true
  - public_access_prevention = "enforced"
  - if checkcov returns error, add other properties if needed
2. resource "google_storage_bucket_iam_member" "tbd-data-bucket-iam-editor" -> assign role storage.objectUser to data service account. Set the following properties:
  - bucket // refere to bucket name from tbd-data-bucket
  - role // follow the instruction above
  - member = "serviceAccount:${var.data_service_account}"
insert the link to the modified file and terraform snippet here

Create PR from this branch to YOUR master and merge it to make new release.

place the screenshot from GA after succesfull application of release with this changes
Analyze terraform code. Play with terraform plan, terraform graph to investigate different modules.

describe one selected module and put the output of terraform graph for this module here
Reach YARN UI

place the port and the screenshot of YARN UI here
Draw an architecture diagram (e.g. in draw.io) that includes:
1. VPC topology with service assignment to subnets
2. Description of the components of service accounts
3. List of buckets for disposal
4. Description of network communication (ports, why it is necessary to specify the host for the driver) of Apache Spark running from Vertex AI Workbech
place your diagram here
Add costs by entering the expected consumption into Infracost

place the expected consumption you entered here

place the screenshot from infracost output here
Some resources are not supported by infracost yet. Estimate manually total costs of infrastructure based on pricing costs for region used in the project. Include costs of cloud composer, dataproc and AI vertex workbanch and them to infracost estimation.

place your estimation and references here

what are the options for cost optimization?
Create a BigQuery dataset and an external table

place the code and output here

why does ORC not require a table schema?
Start an interactive session from Vertex AI workbench (steps 7-9 in README):

place the screenshot of notebook here
Find and correct the error in spark-job.py

describe the cause and how to find the error
Additional tasks using Terraform:
1. Add support for arbitrary machine types and worker nodes for a Dataproc cluster and JupyterLab instance
place the link to the modified file and inserted terraform code
1. Add support for preemptible/spot instances in a Dataproc cluster
place the link to the modified file and inserted terraform code
1. Perform additional hardening of Jupyterlab environment, i.e. disable sudo access and enable secure boot
place the link to the modified file and inserted terraform code
1. (Optional) Get access to Apache Spark WebUI
place the link to the modified file and inserted terraform code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tasks-phase1.md

tasks-phase1.md

Files

tasks-phase1.md

Latest commit

History

tasks-phase1.md

File metadata and controls