Skip to content

Commit

Permalink
cleanup-week1
Browse files Browse the repository at this point in the history
  • Loading branch information
sejalv committed Jan 15, 2022
1 parent f470a47 commit e2f312d
Show file tree
Hide file tree
Showing 48 changed files with 163 additions and 135 deletions.
52 changes: 0 additions & 52 deletions project/terraform/README.md

This file was deleted.

56 changes: 44 additions & 12 deletions week_1_basics_n_setup/1_terraform_gcp/1_terraform_overview.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,50 @@
(In Draft mode)

## Terraform Overview

### Concepts

1. Introduction
2. TF state & backend
3. Google Provider as source
* modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
4. Code: main, resources, variables, locals, outputs
5. Demo
* GCP CLI client (gcloud) - setup & auth
* tf init, plan & apply
#### Introduction
1. What if Terraform?
* open-source tool by HashiCorp, used for provisioning infrastructure resources
* supports DevOps best practices for change management
* Managing configuration files in source control to maintain an ideal provisioning state
for testing and production environments

2. What is IaC?
* Infrastructure-as-Code
* build, change, and manage your infrastructure in a safe, consistent, and repeatable way
by defining resource configurations that you can version, reuse, and share.

3. Some advantages
* Infrastructure lifecycle management
* Version control commits
* Very useful for stack-based deployments, and with cloud providers such as AWS, GCP, Azure, K8S…
* State-based approach to track resource changes throughout deployments


#### Files
* `main.tf`
* `variables.tf`
* Optional: `resources.tf`, `output.tf`
* `.tfstate`

#### Declarations
* `terraform`
* `backend`: state
* `provider`:
* adds a set of resource types and/or data sources that Terraform can manage
* The Terraform Registry is the main directory of publicly available providers from most major infrastructure platforms.
* `resource`
* blocks to define components of your infrastructure
* Project modules/resources: google_storage_bucket, google_bigquery_dataset, google_bigquery_table
* `variable` & `locals`


#### Execution steps
1. `terraform init`: Initialize & install
2. `terraform plan`: Match changes against the previous state
3. `terraform apply`: Apply changes to cloud
4. `terraform destroy`: Remove your stack from cloud


### Workshop
Continue [here](../../project/terraform): `data-engineering-zoomcamp/project/terraform`
### Terraform Workshop for GCP Infra
Continue [here](../terraform): `week_1_basics_n_setup/1_terraform_gcp/terraform`
43 changes: 38 additions & 5 deletions week_1_basics_n_setup/1_terraform_gcp/2_gcp_overview.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,42 @@
(In Draft mode)

## GCP Overview

## Tools & Tech
- Cloud Storage
- BigQuery
### Project infrastructure modules in GCP:
* Google Cloud Storage (GCS): Data Lake
* BigQuery: Data Warehouse

(Concepts explained in Week 2 - Data Ingestion)

### Initial Setup

For this course, we'll use a free version (upto EUR 300 credits).

1. Create an account with your Google email ID
2. Setup your first [project](https://console.cloud.google.com/)
* eg. "DTC DE Course", and note down the "Project ID"
3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project
* Grant `Viewer` role to begin with.
* Download service-account-keys (.json) for auth.
4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
5. Set environment variable to point to your downloaded GCP keys:
```shell
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token, and verify authentication
gcloud auth application-default login
```

### Setup for Access

1. [IAM Roles](https://cloud.google.com/storage/docs/access-control/iam-roles) for Service account:

Viewer + Storage Admin + Storage Object Admin + BigQuery Admin

2. Enable these APIs for your project:
* https://console.cloud.google.com/apis/library/iam.googleapis.com
* https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com

3. Please ensure `GOOGLE_APPLICATION_CREDENTIALS` env-var is set.
```shell
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"
```

31 changes: 11 additions & 20 deletions week_1_basics_n_setup/1_terraform_gcp/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,17 @@
(In Draft mode)

## Local Setup for Terraform and GCP

### Terraform

Installation: https://www.terraform.io/downloads
### Pre-Requisites
1. Terraform client installation: https://www.terraform.io/downloads
2. Cloud Provider account: https://console.cloud.google.com/

### GCP
### Terraform Concepts
[Terraform Overview](1_terraform_overview.md)

For this course, we'll use a free version (upto EUR 300 credits).
### GCP setup

1. Create an account with your Google email ID
2. Setup your first [project](https://console.cloud.google.com/), eg. "DTC DE Course", and note down the "Project ID"
3. Setup [service account & authentication](https://cloud.google.com/docs/authentication/getting-started) for this project, and download auth-keys (.json).
4. Download [SDK](https://cloud.google.com/sdk/docs/quickstart) for local setup
5. Set environment variable to point to your downloaded GCP auth-keys:
```shell
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token, and verify authentication
gcloud auth application-default login
```
1. [Setup for First-time](2_gcp_overview.md#Initial Setup)
2. [IAM / Access specific to this course](2_gcp_overview.md#Setup for Access)

### Workshop
Continue [here](../../project/terraform): `data-engineering-zoomcamp/project/terraform`
### Terraform Workshop for GCP Infra
Continue [here](terraform).
`week_1_basics_n_setup/1_terraform_gcp/terraform`
File renamed without changes.
23 changes: 23 additions & 0 deletions week_1_basics_n_setup/1_terraform_gcp/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

### Execution

```shell
# Refresh service-account's auth-token for this session
gcloud auth application-default login

# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your-project-id>"
```

```shell
# Create new infra
terraform apply -var="project=<your-project-id>"
```

```shell
# Delete infra after your work, to avoid costs on any running services
terraform destroy
```
Original file line number Diff line number Diff line change
Expand Up @@ -40,25 +40,10 @@ resource "google_storage_bucket" "data-lake-bucket" {
force_destroy = true
}

// In-Progress
//
//# DWH
//# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
//resource "google_bigquery_dataset" "dataset" {
// dataset_id = var.BQ_DATASET
//}
//
//# May not be needed if covered by DBT
//resource "google_bigquery_table" "table" {
// dataset_id = google_bigquery_dataset.dw.dataset_id
// table_id = var.TABLE_NAME[count.index]
// count = length(var.TABLE_NAME)
//
// external_data_configuration {
// autodetect = true
// source_format = "CSV"
// source_uris = [
// "gs://${var.BUCKET_NAME}/dw/${var.TABLE_NAME[count.index]}/*.csv"
// ]
// }
//}
# DWH
# Ref: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/bigquery_dataset
resource "google_bigquery_dataset" "dataset" {
dataset_id = var.BQ_DATASET
project = var.project
location = var.region
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@ variable "region" {
type = string
}

# Not needed for now
variable "bucket_name" {
description = "The name of the Google Cloud Storage bucket. Must be globally unique."
default = ""
}

variable "storage_class" {
description = "Storage class type for your bucket. Check official docs for more info."
default = "STANDARD"
}

variable "BQ_DATASET" {
description = "BigQuery Dataset that raw data (from GCS) will be written to"
type = string
default = "trips_data_all"
}

This file was deleted.

3 changes: 0 additions & 3 deletions week_1_basics_n_setup/2_docker_airflow/README.md

This file was deleted.

3 changes: 3 additions & 0 deletions week_1_basics_n_setup/2_docker_postgres_sql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
(In Draft mode)

## Setup Postgres Env with Docker
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
* Remove the `image` tag in `x-airflow-common`, to replace it with your `build` from your Dockerfile.
* Change `AIRFLOW__CORE__LOAD_EXAMPLES` to `false` (optional)

8. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.
7. Here's how the final versions of your [Dockerfile](./Dockerfile) and [docker-compose.yml](./docker-compose.yaml) should look.


### Execution
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator
from google.cloud import storage
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator

PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "pivotal-surfer-336713")
BUCKET = os.environ.get("GCP_GCS_BUCKET", "dtc_data_lake_pivotal-surfer-336713")
Expand All @@ -15,15 +17,11 @@
path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
# path_to_creds = f"{path_to_local_home}/google_credentials.json"

default_args = {
"owner": "airflow",
"start_date": days_ago(1),
"depends_on_past": False,
"retries": 1,
}
DATASET_NAME = os.environ.get("GCP_DATASET_NAME", 'ny_trips_from_dag')
TABLE_NAME = os.environ.get("GCP_TABLE_NAME", 'trips_data_all')


# Takes 15-20 mins to run. Good case for using Spark (distributed processing, in place of chunks)
# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
def upload_to_gcs(bucket, object_name, local_file):
"""
Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
Expand All @@ -44,23 +42,29 @@ def upload_to_gcs(bucket, object_name, local_file):
blob.upload_from_filename(local_file)


default_args = {
"owner": "airflow",
"start_date": days_ago(1),
"depends_on_past": False,
"retries": 1,
}

with DAG(
dag_id="data_ingestion_gcs_dag",
schedule_interval="@daily",
default_args=default_args,
catchup=True,
catchup=False,
max_active_runs=1,
tags=['example'],
) as dag:

# Takes ~2 mins, depending upon your internet's download speed
download_dataset_task = BashOperator(
task_id="download_dataset_task",
bash_command=f"curl -sS {dataset_url} > {path_to_local_home}/{dataset_file}"
)

# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
upload_to_gcs_task = PythonOperator(
task_id="upload_to_gcs_task",
local_to_gcs_task = PythonOperator(
task_id="local_to_gcs_task",
python_callable=upload_to_gcs,
op_kwargs={
"bucket": BUCKET,
Expand All @@ -70,4 +74,17 @@ def upload_to_gcs(bucket, object_name, local_file):
},
)

download_dataset_task >> upload_to_gcs_task
# gcs_to_bq_task = GCSToBigQueryOperator(
# task_id='gcs_to_bq_task',
# bucket=BUCKET,
# source_objects=[f"raw/{dataset_file}"],
# destination_project_dataset_table=f"{DATASET_NAME}.{TABLE_NAME}",
# # schema_fields=[
# # {'name': 'name', 'type': 'STRING', 'mode': 'NULLABLE'},
# # {'name': 'post_abbr', 'type': 'STRING', 'mode': 'NULLABLE'},
# # ],
# write_disposition='WRITE_TRUNCATE',
# )


download_dataset_task >> local_to_gcs_task # >> gcs_to_bq_task
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Empty file.
Empty file.
Empty file.

0 comments on commit e2f312d

Please sign in to comment.