tbd-workshop-1/README-phase-2b.md at master · bdg-tbd/tbd-workshop-1 · GitHub

The goal of phase 2b is to perform benchmarking/scalability tests of sample three-tier lakehouse solution.
In main.tf, change machine_type at:

module "dataproc" {
  depends_on   = [module.vpc]
  source       = "github.com/bdg-tbd/tbd-workshop-1.git?ref=v1.0.36/modules/dataproc"
  project_name = var.project_name
  region       = var.region
  subnet       = module.vpc.subnets[local.notebook_subnet_id].id
  machine_type = "e2-standard-2"
}

and subsititute "e2-standard-2" with "e2-standard-4".

If needed request to increase cpu quotas (e.g. to 30 CPUs): https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=tbd-2023z-9918
Using tbd-tpc-di notebook perform dbt run with different number of executors, i.e., 1, 2, and 5, by changing:

 "spark.executor.instances": "2"

in profiles.yml.

In the notebook, collect console output from dbt run, then parse it and retrieve total execution time and execution times of processing each model. Save the results from each number of executors.
Analyze the performance and scalability of execution times of each model. Visualize and discucss the final results.