Skip to content

Wolt90/airbyte_lab

 
 

Repository files navigation

Setting up Airbyte Data Pipelines Labs

Assignment TODO

⚠️ Attention! Always delete resources after you finish your work!

1. Deploy Infrastructure

  1. Get familiar with Managed Clickhouse Management Console

  2. Install and configure yc CLI: Getting started with the command-line interface by Yandex Cloud

    yc init
  3. Populate .env file

    .env is used to store secrets as environment variables.

    Copy template file .env.template to .env file:

    cp .env.template .env

    Open file in editor and set your own values.

    ❗️ Never commit secrets to git

  4. Set environment variables:

    export YC_TOKEN=$(yc iam create-token)
    export YC_CLOUD_ID=$(yc config get cloud-id)
    export YC_FOLDER_ID=$(yc config get folder-id)
    export $(xargs <.env)
  5. Deploy using Terraform

    terraform init
    terraform validate
    terraform fmt
    terraform plan
    terraform apply

    Store terraform output values as Environment Variables:

    export CLICKHOUSE_HOST=$(terraform output -raw clickhouse_host_fqdn)
    export DBT_HOST=${CLICKHOUSE_HOST}
    export DBT_USER=${CLICKHOUSE_USER}
    export DBT_PASSWORD=${TF_VAR_clickhouse_password}

    [EN] Reference: Getting started with Terraform by Yandex Cloud

    [RU] Reference: Начало работы с Terraform by Yandex Cloud

2. Configure Developer Environment

You have got 3 options to set up:

Start with GitHub Codespaces:

GitHub Codespaces

Install dbt environment with Docker:

Install Docker and run commands:

# build & run container
docker-compose build
docker-compose up -d

# alias docker exec command
alias dbt="docker-compose exec dev dbt"

Alternatively, install dbt on local machine:

Install dbt and configure profile manually by yourself. By default, dbt expects the profiles.yml file to be located in the ~/.dbt/ directory.

Use this template and enter your own credentials.

3. Check database connection

Make sure dbt can connect to your target database:

dbt debug

Configure JDBC (DBeaver) connection:

port=8443
socket_timeout=300000
ssl=true
sslrootcrt=<path_to_cert>

If any errors check ENV values are present:

docker-compose exec dev env | grep DBT_

4. Deploy DWH

  1. Install dbt packages

    dbt deps
  2. Stage data sources with dbt macro

    Source data will be staged as EXTERNAL TABLES (S3) using dbt macro init_s3_sources:

    dbt run-operation init_s3_sources

    Statements will be run separately from a list to avoid error:

    DB::Exception: Syntax error (Multi-statements are not allowed)
    
  3. Describe sources in sources.yml files

  4. Build staging models:

    dbt build -s tag:staging

    Check model configurations: engine, order_by, partition_by

  5. Prepare wide table (Data Mart)

    Join all the tables into one f_lineorder_flat:

    dbt build -s f_lineorder_flat

    Pay attentions to models being tested for keys being unique, not null.

5. Model read-optimized Data Mart

Turn the following SQL into dbt model f_orders_stats:

SELECT
    toYear(O_ORDERDATE) AS O_ORDERYEAR
    , O_ORDERSTATUS
    , O_ORDERPRIORITY
    , count(DISTINCT O_ORDERKEY) AS num_orders
    , count(DISTINCT C_CUSTKEY) AS num_customers
    , sum(L_EXTENDEDPRICE * L_DISCOUNT) AS revenue
FROM f_lineorder_flat
WHERE 1=1
GROUP BY
    toYear(O_ORDERDATE)
    , O_ORDERSTATUS
    , O_ORDERPRIORITY

Make sure the tests pass:

dbt build -s f_orders_stats

6. Create PR and make CI tests pass

Github Actions check passed

Shut down your cluster

⚠️ Attention! Always delete resources after you finish your work!

image

terraform destroy

Lesson plan

  • Deploy Clickhouse
  • Configure development environment
  • Configure dbt project (dbt_project.yml)
  • Configure connection (profiles.yml)
  • Prepare source data files (S3)
  • Configure EXTERNAL TABLES (S3)
  • Describe sources in .yml files
  • Basic dbt models and configurations
  • Code compilation + debugging
  • Prepare STAR schema
  • Querying results
  • Testing & Documenting your project

About

Airbyte Data Pipelines Lab

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HCL 82.5%
  • Dockerfile 17.5%