Skip to content

Latest commit

 

History

History
33 lines (22 loc) · 1.14 KB

README-phase-2b.md

File metadata and controls

33 lines (22 loc) · 1.14 KB
  1. The goal of phase 2b is to perform benchmarking/scalability tests of sample three-tier lakehouse solution.

  2. In main.tf, change machine_type at:

module "dataproc" {
  depends_on   = [module.vpc]
  source       = "github.com/bdg-tbd/tbd-workshop-1.git?ref=v1.0.36/modules/dataproc"
  project_name = var.project_name
  region       = var.region
  subnet       = module.vpc.subnets[local.notebook_subnet_id].id
  machine_type = "e2-standard-2"
}

and subsititute "e2-standard-2" with "e2-standard-4".

  1. If needed request to increase cpu quotas (e.g. to 30 CPUs): https://console.cloud.google.com/apis/api/compute.googleapis.com/quotas?project=tbd-2023z-9918

  2. Using tbd-tpc-di notebook perform dbt run with different number of executors, i.e., 1, 2, and 5, by changing:

 "spark.executor.instances": "2"

in profiles.yml.

  1. In the notebook, collect console output from dbt run, then parse it and retrieve total execution time and execution times of processing each model. Save the results from each number of executors.

  2. Analyze the performance and scalability of execution times of each model. Visualize and discucss the final results.