Skip to content

Latest commit

 

History

History
76 lines (60 loc) · 2.8 KB

note_week2.md

File metadata and controls

76 lines (60 loc) · 2.8 KB

Week 2 - Document

Data lake Introduction

  • Data lake vs Data warehouse
    • more unorganized (no need to define relationships or write the schema)
    • faster, and cheaper
    • Data swamp → No useful data for further usage such as can’t join, or different file type
  • ETL vs ELT
    • ETL - Small data → ready for further use (organized)
    • ELT - Large data → need further processes before usage
  • Cloud provider
    • GCP - Google Cloud - Cloud Storage
    • AWS - Amazon Web Service - S3
    • Azure - Microsoft Azure - Azure Blob

Introduction to workflow orchestration (Prefect)

  • orchestration: governing data flow respecting orchestration rules and business logic
  • data flow: binding disparate sets of applications together, so they can run schedule
  • Core of orchestration
    • Remote execution
    • scheduling
    • retries
    • caching
    • integrated with external system (APIs, databases)
    • Ad-hoc runs
    • Parameterization
    • Alerting when something fail

all about Prefect deployment is skipped

DE Zoomcamp 2022 Using Airflow

  • Popular orchestration tools: Airflow, Prefect

  • Airflow consist of 3 main component

    • Webserver: UI
    • Scheduler (Executor)
    • Metadata Database (backend airflow environment)
  • Setting up Airflow

    1. create sub-directory airflow at the current project dir

    2. set Airflow user: do in GitBash in airflow directory

      mkdir -p ./dags ./logs ./plugins
      echo -e "AIRFLOW_UID=$(id -u)" > .env

      or create .env file and then type in “AIRFLOW_UID=50000”

    3. import the latest official setup template docker-compose.yml file

      curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'

      which contain a lot of services defined. check cleaned version of docker-compose.yml

      docker-compose_cleaned.yml

  • Ingest Data into postgres database

    • Writing DAG
    • make it scheduling and parameterizing (accept different url or save different file name)
    • connect with Postgres database ( create_engine → connect() → load by chuck )
    • if run the docker-compose file separately, we need to use the network to make containers communicable

Transfer Service

  • Data Transfer in gcp
    • it can transfer data from S3 (aws) or azure blob (microsoft) to gcs
      • Note: to transfer s3 to gcs, we need access key → get access key from aws site
    • we are able to set to make transferring scheduling (not recommend due to cost)
    • config like creating a new bucket
  • done by
    • GCP UI → Data Transfer
    • Terraform → Terraform google storage transfer job