- Build DWH powered by Clickhouse and dbt
- Deploy Infrastructure as Code with Terraform and Yandex.Cloud
- Instant development with Github Codespaces
- Assignment checks with Github Actions
-
IaC (Terraform):
- S3 Bucket
- VM
- Clickhouse
-
Install Airbyte on VM (Packer, Vagrant?)
-
Access Airbyte through Web UI
-
Configure Pipelines
- Postgres to Clickhouse
- Postgres to S3
-
Test assignment with Github Actions
- Query files on S3 with Clickhouse S3 table engine
-
Get familiar with Managed Clickhouse Management Console
-
Install and configure
yc
CLI: Getting started with the command-line interface by Yandex Cloudyc init
-
Populate
.env
file.env
is used to store secrets as environment variables.Copy template file .env.template to
.env
file:cp .env.template .env
Open file in editor and set your own values.
❗️ Never commit secrets to git
-
Set environment variables:
export YC_TOKEN=$(yc iam create-token) export YC_CLOUD_ID=$(yc config get cloud-id) export YC_FOLDER_ID=$(yc config get folder-id) export $(xargs <.env)
-
Deploy using Terraform
terraform init terraform validate terraform fmt terraform plan terraform apply
Store terraform output values as Environment Variables:
export CLICKHOUSE_HOST=$(terraform output -raw clickhouse_host_fqdn) export DBT_HOST=${CLICKHOUSE_HOST} export DBT_USER=${CLICKHOUSE_USER} export DBT_PASSWORD=${TF_VAR_clickhouse_password}
[EN] Reference: Getting started with Terraform by Yandex Cloud
[RU] Reference: Начало работы с Terraform by Yandex Cloud
You have got 3 options to set up:
Install dbt environment with Docker:
Install Docker and run commands:
# build & run container
docker-compose build
docker-compose up -d
# alias docker exec command
alias dbt="docker-compose exec dev dbt"
Alternatively, install dbt on local machine:
Install dbt and configure profile manually by yourself. By default, dbt expects the profiles.yml
file to be located in the ~/.dbt/
directory.
Use this template and enter your own credentials.
Make sure dbt can connect to your target database:
dbt debug
Configure JDBC (DBeaver) connection:
port=8443
socket_timeout=300000
ssl=true
sslrootcrt=<path_to_cert>
If any errors check ENV values are present:
docker-compose exec dev env | grep DBT_
-
Install dbt packages
dbt deps
-
Stage data sources with dbt macro
Source data will be staged as EXTERNAL TABLES (S3) using dbt macro init_s3_sources:
dbt run-operation init_s3_sources
Statements will be run separately from a list to avoid error:
DB::Exception: Syntax error (Multi-statements are not allowed)
-
Describe sources in sources.yml files
-
Build staging models:
dbt build -s tag:staging
Check model configurations:
engine
,order_by
,partition_by
-
Prepare wide table (Data Mart)
Join all the tables into one f_lineorder_flat:
dbt build -s f_lineorder_flat
Pay attentions to models being tested for keys being unique, not null.
Turn the following SQL into dbt model f_orders_stats:
SELECT
toYear(O_ORDERDATE) AS O_ORDERYEAR
, O_ORDERSTATUS
, O_ORDERPRIORITY
, count(DISTINCT O_ORDERKEY) AS num_orders
, count(DISTINCT C_CUSTKEY) AS num_customers
, sum(L_EXTENDEDPRICE * L_DISCOUNT) AS revenue
FROM f_lineorder_flat
WHERE 1=1
GROUP BY
toYear(O_ORDERDATE)
, O_ORDERSTATUS
, O_ORDERPRIORITY
Make sure the tests pass:
dbt build -s f_orders_stats
terraform destroy
- Deploy Clickhouse
- Configure development environment
- Configure dbt project (
dbt_project.yml
) - Configure connection (
profiles.yml
) - Prepare source data files (S3)
- Configure EXTERNAL TABLES (S3)
- Describe sources in .yml files
- Basic dbt models and configurations
- Code compilation + debugging
- Prepare STAR schema
- Querying results
- Testing & Documenting your project