cleanup-week1

mociarain · Jan 15, 2022 · e2f312d · e2f312d
1 parent f470a47
commit e2f312d
Show file tree

Hide file tree

Showing 48 changed files with 163 additions and 135 deletions.
diff --git a/project/terraform/README.md b/project/terraform/README.md
diff --git a/week_1_basics_n_setup/1_terraform_gcp/1_terraform_overview.md b/week_1_basics_n_setup/1_terraform_gcp/1_terraform_overview.md
@@ -1,18 +1,50 @@
-(In Draft mode)
-
 ## Terraform Overview
 
 ### Concepts
 
-1. Introduction
-2. TF state & backend
-3. Google Provider as source
-   * modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
-4. Code: main, resources, variables, locals, outputs
-5. Demo
-   * GCP CLI client (gcloud) - setup & auth
-   * tf init, plan & apply 
+#### Introduction
+1. What if Terraform?
+   * open-source tool by HashiCorp, used for provisioning infrastructure resources
+   * supports DevOps best practices for change management
+   * Managing configuration files in source control to maintain an ideal provisioning state 
+     for testing and production environments
+
+2. What is IaC?
+   * Infrastructure-as-Code
+   * build, change, and manage your infrastructure in a safe, consistent, and repeatable way 
+     by defining resource configurations that you can version, reuse, and share.
+
+3. Some advantages
+   * Infrastructure lifecycle management
+   * Version control commits
+   * Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S…
+   * State-based approach to track resource changes throughout deployments
+
+
+#### Files
+* `main.tf`
+* `variables.tf`
+* Optional: `resources.tf`, `output.tf`
+* `.tfstate`
+
+#### Declarations
+* `terraform`
+   * `backend`: state
+* `provider`:
+   * adds a set of resource types and/or data sources that Terraform can manage
+   * The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.
+* `resource`
+  * blocks to define components of your infrastructure
+  * Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
+* `variable` & `locals`
+
+
+#### Execution steps
+1. `terraform init`: Initialize & install
+2. `terraform plan`: Match changes against the previous state
+3. `terraform apply`: Apply changes to cloud
+4. `terraform destroy`: Remove your stack from cloud
 
 
-### Workshop
-Continue [here](../../project/terraform): `data-engineering-zoomcamp/project/terraform`
+### Terraform Workshop for GCP Infra
+Continue [here](../terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`
diff --git a/week_1_basics_n_setup/1_terraform_gcp/2_gcp_overview.md b/week_1_basics_n_setup/1_terraform_gcp/2_gcp_overview.md
@@ -1,9 +1,42 @@
-(In Draft mode)
-
 ## GCP Overview
 
-## Tools & Tech
-- Cloud Storage
-- BigQuery
+### Project infrastructure modules in GCP:
+* Google Cloud Storage (GCS): Data Lake
+* BigQuery: Data Warehouse
 
 (Concepts explained in Week 2 - Data Ingestion)
+
+### Initial Setup
+
+For this course, we'll use a free version (upto EUR 300 credits). 
+
+1. Create an account with your Google email ID 
+2. Setup your first [project](https://console.cloud.google.com/)
+    * eg. "DTC DE Course", and note down the "Project ID"
+3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project
+    * Grant `Viewer` role to begin with.
+    * Download service-account-keys (.json) for auth.
+4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
+5. Set environment variable to point to your downloaded GCP keys:
+   ```shell
+   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
+
+   # Refresh token, and verify authentication
+   gcloud auth application-default login
+   ```
+
+### Setup for Access
+
+1. [IAM Roles](https://cloud.google.com/storage/docs/access-control/iam-roles) for Service account:
+
+   Viewer + Storage Admin + Storage Object Admin + BigQuery Admin
+
+2. Enable these APIs for your project:
+   * https://console.cloud.google.com/apis/library/iam.googleapis.com
+   * https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com
+
+3. Please ensure `GOOGLE_APPLICATION_CREDENTIALS` env-var is set.
+   ```shell
+   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
+   ```
+
diff --git a/week_1_basics_n_setup/1_terraform_gcp/README.md b/week_1_basics_n_setup/1_terraform_gcp/README.md
@@ -1,26 +1,17 @@
-(In Draft mode)
-
 ## Local Setup for Terraform and GCP
 
-### Terraform
-
-Installation: https://www.terraform.io/downloads
+### Pre-Requisites
+1. Terraform client installation: https://www.terraform.io/downloads
+2. Cloud Provider account: https://console.cloud.google.com/ 
 
-### GCP
+### Terraform Concepts
+[Terraform Overview](1_terraform_overview.md)
 
-For this course, we'll use a free version (upto EUR 300 credits). 
+### GCP setup
 
-1. Create an account with your Google email ID 
-2. Setup your first [project](https://console.cloud.google.com/), eg. "DTC DE Course", and note down the "Project ID"
-3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project, and download auth-keys (.json).
-4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
-5. Set environment variable to point to your downloaded GCP auth-keys:
-   ```shell
-   export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
-
-   # Refresh token, and verify authentication
-   gcloud auth application-default login
-   ```
+1. [Setup for First-time](2_gcp_overview.md#Initial Setup)
+2. [IAM / Access specific to this course](2_gcp_overview.md#Setup for Access)
 
-### Workshop
-Continue [here](../../project/terraform): `data-engineering-zoomcamp/project/terraform`
+### Terraform Workshop for GCP Infra
+Continue [here](terraform).
+`week_1_basics_n_setup/1_terraform_gcp/terraform`
diff --git a/project/terraform/.terraform-version → ...erraform_gcp/terraform/.terraform-version b/project/terraform/.terraform-version → ...erraform_gcp/terraform/.terraform-version
diff --git a/week_1_basics_n_setup/1_terraform_gcp/terraform/README.md b/week_1_basics_n_setup/1_terraform_gcp/terraform/README.md
@@ -0,0 +1,23 @@
+
+### Execution
+
+```shell
+# Refresh service-account's auth-token for this session
+gcloud auth application-default login
+
+# Initialize state file (.tfstate)
+terraform init
+
+# Check changes to new infra plan
+terraform plan -var="project=<your-project-id>"
+```
+
+```shell
+# Create new infra
+terraform apply -var="project=<your-project-id>"
+```
+
+```shell
+# Delete infra after your work, to avoid costs on any running services
+terraform destroy
+```
diff --git a/project/terraform/main.tf → ...n_setup/1_terraform_gcp/terraform/main.tf b/project/terraform/main.tf → ...n_setup/1_terraform_gcp/terraform/main.tf
@@ -40,25 +40,10 @@ resource "google_storage_bucket" "data-lake-bucket" {
   force_destroy = true
 }
 
-// In-Progress
-//
-//# DWH
-//# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
-//resource "google_bigquery_dataset" "dataset" {
-//  dataset_id = var.BQ_DATASET
-//}
-//
-//# May not be needed if covered by DBT
-//resource "google_bigquery_table" "table" {
-//  dataset_id = google_bigquery_dataset.dw.dataset_id
-//  table_id   = var.TABLE_NAME[count.index]
-//  count      = length(var.TABLE_NAME)
-//
-//  external_data_configuration {
-//    autodetect    = true
-//    source_format = "CSV"
-//    source_uris = [
-//      "gs://${var.BUCKET_NAME}/dw/${var.TABLE_NAME[count.index]}/*.csv"
-//    ]
-//  }
-//}
+# DWH
+# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
+resource "google_bigquery_dataset" "dataset" {
+  dataset_id = var.BQ_DATASET
+  project    = var.project
+  location   = var.region
+}
diff --git a/project/terraform/variables.tf → ...up/1_terraform_gcp/terraform/variables.tf b/project/terraform/variables.tf → ...up/1_terraform_gcp/terraform/variables.tf
@@ -12,13 +12,13 @@ variable "region" {
   type = string
 }
 
-# Not needed for now
-variable "bucket_name" {
-  description = "The name of the Google Cloud Storage bucket. Must be globally unique."
-  default = ""
-}
-
 variable "storage_class" {
   description = "Storage class type for your bucket. Check official docs for more info."
   default = "STANDARD"
 }
+
+variable "BQ_DATASET" {
+  description = "BigQuery Dataset that raw data (from GCS) will be written to"
+  type = string
+  default = "trips_data_all"
+}
diff --git a/week_1_basics_n_setup/2_docker_airflow/2_airflow_overview.md b/week_1_basics_n_setup/2_docker_airflow/2_airflow_overview.md
diff --git a/week_1_basics_n_setup/2_docker_airflow/README.md b/week_1_basics_n_setup/2_docker_airflow/README.md
diff --git a/...tup/2_docker_airflow/1_docker_overview.md → ..._docker_postgres_sql/1_docker_overview.md b/...tup/2_docker_airflow/1_docker_overview.md → ..._docker_postgres_sql/1_docker_overview.md
diff --git a/...asics_n_setup/3_sql_dbt/1_sql_overview.md → ...p/2_docker_postgres_sql/1_sql_overview.md b/...asics_n_setup/3_sql_dbt/1_sql_overview.md → ...p/2_docker_postgres_sql/1_sql_overview.md
diff --git a/week_1_basics_n_setup/2_docker_postgres_sql/README.md b/week_1_basics_n_setup/2_docker_postgres_sql/README.md
@@ -0,0 +1,3 @@
+(In Draft mode)
+
+## Setup Postgres Env with Docker
diff --git a/...n_setup/3_sql_dbt/2_dbt_overview_setup.md → week_2_data_ingestion/README.md b/...n_setup/3_sql_dbt/2_dbt_overview_setup.md → week_2_data_ingestion/README.md
diff --git a/project/airflow/Dockerfile → week_2_data_ingestion/airflow/Dockerfile b/project/airflow/Dockerfile → week_2_data_ingestion/airflow/Dockerfile
diff --git a/project/airflow/README.md → week_2_data_ingestion/airflow/README.md b/project/airflow/README.md → week_2_data_ingestion/airflow/README.md
@@ -24,7 +24,7 @@
    * Remove the `image` tag in `x-airflow-common`, to replace it with your `build` from your Dockerfile.
    * Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional)
 
-8. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.
+7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.
 
 
 ### Execution

diff --git a/...ct/airflow/dags/data_ingestion_gcs_dag.py → ...on/airflow/dags/data_ingestion_gcs_dag.py b/...ct/airflow/dags/data_ingestion_gcs_dag.py → ...on/airflow/dags/data_ingestion_gcs_dag.py
@@ -6,6 +6,8 @@
 from airflow.operators.bash import BashOperator
 from airflow.operators.python import PythonOperator
 from google.cloud import storage
+from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
+from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator
 
 PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "pivotal-surfer-336713")
 BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc_data_lake_pivotal-surfer-336713")
@@ -15,15 +17,11 @@
 path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
 # path_to_creds = f"{path_to_local_home}/google_credentials.json"
 
-default_args = {
-    "owner": "airflow",
-    "start_date": days_ago(1),
-    "depends_on_past": False,
-    "retries": 1,
-}
+DATASET_NAME = os.environ.get("GCP_DATASET_NAME", 'ny_trips_from_dag')
+TABLE_NAME = os.environ.get("GCP_TABLE_NAME", 'trips_data_all')
 
 
-# Takes 15-20 mins to run. Good case for using Spark (distributed processing, in place of chunks)
+# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
 def upload_to_gcs(bucket, object_name, local_file):
     """
     Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
@@ -44,23 +42,29 @@ def upload_to_gcs(bucket, object_name, local_file):
     blob.upload_from_filename(local_file)
 
 
+default_args = {
+    "owner": "airflow",
+    "start_date": days_ago(1),
+    "depends_on_past": False,
+    "retries": 1,
+}
+
 with DAG(
     dag_id="data_ingestion_gcs_dag",
     schedule_interval="@daily",
     default_args=default_args,
-    catchup=True,
+    catchup=False,
     max_active_runs=1,
+    tags=['example'],
 ) as dag:
 
-    # Takes ~2 mins, depending upon your internet's download speed
     download_dataset_task = BashOperator(
         task_id="download_dataset_task",
         bash_command=f"curl -sS {dataset_url} > {path_to_local_home}/{dataset_file}"
     )
 
-    # NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
-    upload_to_gcs_task = PythonOperator(
-        task_id="upload_to_gcs_task",
+    local_to_gcs_task = PythonOperator(
+        task_id="local_to_gcs_task",
         python_callable=upload_to_gcs,
         op_kwargs={
             "bucket": BUCKET,
@@ -70,4 +74,17 @@ def upload_to_gcs(bucket, object_name, local_file):
         },
     )
 
-    download_dataset_task >> upload_to_gcs_task
+    # gcs_to_bq_task = GCSToBigQueryOperator(
+    #     task_id='gcs_to_bq_task',
+    #     bucket=BUCKET,
+    #     source_objects=[f"raw/{dataset_file}"],
+    #     destination_project_dataset_table=f"{DATASET_NAME}.{TABLE_NAME}",
+    #     # schema_fields=[
+    #     #     {'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
+    #     #     {'name': 'post_abbr', 'type': 'STRING', 'mode': 'NULLABLE'},
+    #     # ],
+    #     write_disposition='WRITE_TRUNCATE',
+    # )
+
+
+    download_dataset_task >> local_to_gcs_task # >> gcs_to_bq_task
diff --git a/...irflow/dags/data_ingestion_localDB_dag.py → ...irflow/dags/data_ingestion_localDB_dag.py b/...irflow/dags/data_ingestion_localDB_dag.py → ...irflow/dags/data_ingestion_localDB_dag.py
diff --git a/project/airflow/docker-compose.yaml → ...ata_ingestion/airflow/docker-compose.yaml b/project/airflow/docker-compose.yaml → ...ata_ingestion/airflow/docker-compose.yaml
diff --git a/...flow/extras/data_ingestion_gcs_dag_ex2.py → ...flow/extras/data_ingestion_gcs_dag_ex2.py b/...flow/extras/data_ingestion_gcs_dag_ex2.py → ...flow/extras/data_ingestion_gcs_dag_ex2.py
diff --git a/...irflow/extras/docker-compose-nofrills.yml → ...irflow/extras/docker-compose-nofrills.yml b/...irflow/extras/docker-compose-nofrills.yml → ...irflow/extras/docker-compose-nofrills.yml
diff --git a/project/airflow/extras/web_to_gcs.sh → ...ta_ingestion/airflow/extras/web_to_gcs.sh b/project/airflow/extras/web_to_gcs.sh → ...ta_ingestion/airflow/extras/web_to_gcs.sh
diff --git a/project/airflow/requirements.txt → ...2_data_ingestion/airflow/requirements.txt b/project/airflow/requirements.txt → ...2_data_ingestion/airflow/requirements.txt
diff --git a/week_3_4_batch_processing/.gitignore → week_3_data_warehouse/README.md b/week_3_4_batch_processing/.gitignore → week_3_data_warehouse/README.md
diff --git a/...gineering/taxi_rides_ny/analysis/.gitkeep → week_4_analytics_engineering/README.md b/...gineering/taxi_rides_ny/analysis/.gitkeep → week_4_analytics_engineering/README.md
diff --git a/...tics_engineering/taxi_rides_ny/.gitignore → ...tics_engineering/taxi_rides_ny/.gitignore b/...tics_engineering/taxi_rides_ny/.gitignore → ...tics_engineering/taxi_rides_ny/.gitignore
diff --git a/...ytics_engineering/taxi_rides_ny/README.md → ...ytics_engineering/taxi_rides_ny/README.md b/...ytics_engineering/taxi_rides_ny/README.md → ...ytics_engineering/taxi_rides_ny/README.md
diff --git a/...s_engineering/taxi_rides_ny/data/.gitkeep → ...gineering/taxi_rides_ny/analysis/.gitkeep b/...s_engineering/taxi_rides_ny/data/.gitkeep → ...gineering/taxi_rides_ny/analysis/.gitkeep
diff --git a/...engineering/taxi_rides_ny/macros/.gitkeep → ...s_engineering/taxi_rides_ny/data/.gitkeep b/...engineering/taxi_rides_ny/macros/.gitkeep → ...s_engineering/taxi_rides_ny/data/.gitkeep
diff --git a/...g/taxi_rides_ny/data/seeds_properties.yml → ...g/taxi_rides_ny/data/seeds_properties.yml b/...g/taxi_rides_ny/data/seeds_properties.yml → ...g/taxi_rides_ny/data/seeds_properties.yml
diff --git a/...g/taxi_rides_ny/data/taxi_zone_lookup.csv → ...g/taxi_rides_ny/data/taxi_zone_lookup.csv b/...g/taxi_rides_ny/data/taxi_zone_lookup.csv → ...g/taxi_rides_ny/data/taxi_zone_lookup.csv
diff --git a/...engineering/taxi_rides_ny/dbt_project.yml → ...engineering/taxi_rides_ny/dbt_project.yml b/...engineering/taxi_rides_ny/dbt_project.yml → ...engineering/taxi_rides_ny/dbt_project.yml
diff --git a/...ineering/taxi_rides_ny/snapshots/.gitkeep → ...engineering/taxi_rides_ny/macros/.gitkeep b/...ineering/taxi_rides_ny/snapshots/.gitkeep → ...engineering/taxi_rides_ny/macros/.gitkeep
diff --git a/...y/macros/get_payment_type_description.sql → ...y/macros/get_payment_type_description.sql b/...y/macros/get_payment_type_description.sql → ...y/macros/get_payment_type_description.sql
diff --git a/...axi_rides_ny/macros/macros_properties.yml → ...axi_rides_ny/macros/macros_properties.yml b/...axi_rides_ny/macros/macros_properties.yml → ...axi_rides_ny/macros/macros_properties.yml
diff --git a/...g/taxi_rides_ny/models/core/dim_zones.sql → ...g/taxi_rides_ny/models/core/dim_zones.sql b/...g/taxi_rides_ny/models/core/dim_zones.sql → ...g/taxi_rides_ny/models/core/dim_zones.sql
diff --git a/.../taxi_rides_ny/models/core/fact_trips.sql → .../taxi_rides_ny/models/core/fact_trips.sql b/.../taxi_rides_ny/models/core/fact_trips.sql → .../taxi_rides_ny/models/core/fact_trips.sql
diff --git a/...ls/data-marts/dm_monthly_zone_revenue.sql → ...ls/data-marts/dm_monthly_zone_revenue.sql b/...ls/data-marts/dm_monthly_zone_revenue.sql → ...ls/data-marts/dm_monthly_zone_revenue.sql
diff --git a/...g/taxi_rides_ny/models/staging/schema.yml → ...g/taxi_rides_ny/models/staging/schema.yml b/...g/taxi_rides_ny/models/staging/schema.yml → ...g/taxi_rides_ny/models/staging/schema.yml
diff --git a/..._ny/models/staging/stg_fhbhv_tripdata.sql → ..._ny/models/staging/stg_fhbhv_tripdata.sql b/..._ny/models/staging/stg_fhbhv_tripdata.sql → ..._ny/models/staging/stg_fhbhv_tripdata.sql
diff --git a/...es_ny/models/staging/stg_fhv_tripdata.sql → ...es_ny/models/staging/stg_fhv_tripdata.sql b/...es_ny/models/staging/stg_fhv_tripdata.sql → ...es_ny/models/staging/stg_fhv_tripdata.sql
diff --git a/..._ny/models/staging/stg_green_tripdata.sql → ..._ny/models/staging/stg_green_tripdata.sql b/..._ny/models/staging/stg_green_tripdata.sql → ..._ny/models/staging/stg_green_tripdata.sql
diff --git a/...ny/models/staging/stg_yellow_tripdata.sql → ...ny/models/staging/stg_yellow_tripdata.sql b/...ny/models/staging/stg_yellow_tripdata.sql → ...ny/models/staging/stg_yellow_tripdata.sql
diff --git a/...cs_engineering/taxi_rides_ny/packages.yml → ...cs_engineering/taxi_rides_ny/packages.yml b/...cs_engineering/taxi_rides_ny/packages.yml → ...cs_engineering/taxi_rides_ny/packages.yml
diff --git a/..._engineering/taxi_rides_ny/tests/.gitkeep → ...ineering/taxi_rides_ny/snapshots/.gitkeep b/..._engineering/taxi_rides_ny/tests/.gitkeep → ...ineering/taxi_rides_ny/snapshots/.gitkeep
diff --git a/week_4_analytics_engineering/taxi_rides_ny/tests/.gitkeep b/week_4_analytics_engineering/taxi_rides_ny/tests/.gitkeep
diff --git a/week_5_batch_processing/.gitignore b/week_5_batch_processing/.gitignore
diff --git a/week_5_batch_processing/README.md b/week_5_batch_processing/README.md
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		(In Draft mode)

		## Setup Postgres Env with Docker