Skip to content

Commit

Permalink
Pending changes exported from your codespace
Browse files Browse the repository at this point in the history
  • Loading branch information
kzzzr committed Mar 8, 2023
1 parent 5b4207b commit 8c83aee
Show file tree
Hide file tree
Showing 18 changed files with 361 additions and 126 deletions.
196 changes: 91 additions & 105 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@

- [ ] IaC (Terraform):
- [x] S3 Bucket
- [ ] VM
- [x] VM
- [x] Clickhouse
- [ ] Install Airbyte on VM (Packer, Vagrant?)
- [ ] Will this image be working on another YC account / folder?
- [ ] Access Airbyte through Web UI
- [ ] Configure Pipelines
- [ ] Postgres to Clickhouse
Expand All @@ -30,7 +31,7 @@

## 1. Configure Developer Environment

You have got 3 options to set up:
You have 2 options to set up:

<details><summary>Start with GitHub Codespaces:</summary>
<p>
Expand All @@ -57,20 +58,8 @@ alias dbt="docker-compose exec dev dbt"
</p>
</details>

<details><summary>Alternatively, install dbt on local machine:</summary>
<p>

[Install dbt](https://docs.getdbt.com/dbt-cli/install/overview) and [configure profile](https://docs.getdbt.com/dbt-cli/configure-your-profile) manually by yourself. By default, dbt expects the `profiles.yml` file to be located in the `~/.dbt/` directory.

Use this [template](./profiles.yml) and enter your own credentials.
</p>
</details>

## 2. Deploy Infrastructure

1. Get familiar with Managed Clickhouse Management Console

![](./docs/clickhouse_management_console.gif)

1. Install and configure `yc` CLI: [Getting started with the command-line interface by Yandex Cloud](https://cloud.yandex.com/en/docs/cli/quickstart#install)

Expand Down Expand Up @@ -98,14 +87,17 @@ Use this [template](./profiles.yml) and enter your own credentials.
export YC_CLOUD_ID=$(yc config get cloud-id)
export YC_FOLDER_ID=$(yc config get folder-id)
export TF_VAR_folder_id=$(yc config get folder-id)
export $(xargs <.env)
export $(xargs < .env)
export TF_LOG_PATH=./terraform.log
export TF_LOG=trace
## DEBUG
# export TF_LOG_PATH=./terraform.log
# export TF_LOG=trace
```

1. Deploy using Terraform

Get familiar with Cloud Infrastructure: [main.tf](./main.tf) and [variables.tf](./variables.tf)

```bash
terraform init
terraform validate
Expand All @@ -114,57 +106,114 @@ Use this [template](./profiles.yml) and enter your own credentials.
terraform apply
```

Spin up Airbyte:
Store terraform output values as Environment Variables:

```bash
export CLICKHOUSE_HOST=$(terraform output -raw clickhouse_host_fqdn)
export DBT_HOST=${CLICKHOUSE_HOST}
export DBT_USER=${CLICKHOUSE_USER}
export DBT_PASSWORD=${TF_VAR_clickhouse_password}
```

[EN] Reference: [Getting started with Terraform by Yandex Cloud](https://cloud.yandex.com/en/docs/tutorials/infrastructure-management/terraform-quickstart)

[RU] Reference: [Начало работы с Terraform by Yandex Cloud](https://cloud.yandex.ru/docs/tutorials/infrastructure-management/terraform-quickstart)

## 3. Deploy Airbyte

1. Get VM's public IP:
```bash
terraform output -raw yandex_compute_instance_nat_ip_address
```
2. Lab's VM image already has Airbyte installed. However if you'd like to do it yourself:
```bash
ssh airbyte@{yandex_compute_instance_nat_ip_address}
sudo mkdir airbyte && cd airbyte
sudo wget https://raw.githubusercontent.com/airbytehq/airbyte-platform/main/{.env,flags.yml,docker-compose.yaml}
sudo docker compose up -d # run the Docker container
sudo docker-compose up -d
```
UI login:
3. Access UI at {yandex_compute_instance_nat_ip_address}:8000
With credentials:
```
airbyte
password
```
Store terraform output values as Environment Variables:
![Airbyte UI](./docs/airbyte_ui.png)
## 4. Configure Data Pipelines
1. Configure Postgres Source
Get database credentials: https://github.com/kzzzr/mybi-dbt-showcase/blob/main/dbt_project.yml#L31-L36
❗️ Supply JDBC URL Parameter: `prepareThreshold=0`
![](./docs/airbyte_source_postgres.png)
1. Configure Clickhouse Destination
![](./docs/airbyte_destination_clickhouse.png)
1. Configure S3 Destination
Gather key pair:
```bash
export CLICKHOUSE_HOST=$(terraform output -raw clickhouse_host_fqdn)
export DBT_HOST=${CLICKHOUSE_HOST}
export DBT_USER=${CLICKHOUSE_USER}
export DBT_PASSWORD=${TF_VAR_clickhouse_password}
terraform output -raw yandex_iam_service_account_static_access_key
terraform output -raw yandex_iam_service_account_static_secret_key
```
[EN] Reference: [Getting started with Terraform by Yandex Cloud](https://cloud.yandex.com/en/docs/tutorials/infrastructure-management/terraform-quickstart)

[RU] Reference: [Начало работы с Terraform by Yandex Cloud](https://cloud.yandex.ru/docs/tutorials/infrastructure-management/terraform-quickstart)
Make sure you choose S3 Bucket Path = `mybi`
![](./docs/airbyte_destination_s3_1.png)
❗️ Set Destination Connector S3 version to `0.1.16`. Otherwise you will get errors with Yandex.Cloud Object Storage.
![](./docs/airbyte_destination_s3_3.png)
1. Sync data to Clickhouse Destination
Only sync tables with `general_` prefix.
![](./docs/airbyte_sync_clickhouse_1.png)
![](./docs/airbyte_sync_clickhouse_2.png)
![](./docs/airbyte_sync_clickhouse_3.png)
1. Sync data to S3 Destination
## 3. Check database connection
Only sync tables with `general_` prefix.
Make sure dbt can connect to your target database:
![](./docs/airbyte_sync_s3_1.png)
![](./docs/airbyte_sync_s3_2.png)
![](./docs/airbyte_sync_s3_3.png)
## 5. Create PR and make CI tests pass
Since you have synced data to S3 bucket with public access, this data now should be available as Clickhouse External Table.
Set VARIABLE.
Let's make sure it works:

```bash
dbt debug
dbt test
```

[Configure JDBC (DBeaver) connection](https://cloud.yandex.ru/docs/managed-clickhouse/operations/connect#connection-ide):
If it works for you, open PR and see if CI tests pass.

```
port=8443
socket_timeout=300000
ssl=true
sslrootcrt=<path_to_cert>
```
![Github Actions check passed](./docs/github_checks_passed.png)

----------

If any errors check ENV values are present:
```
docker-compose exec dev env | grep DBT_
```

## 4. Deploy DWH

1. Install dbt packages

Expand All @@ -188,54 +237,6 @@ docker-compose exec dev env | grep DBT_

1. Describe sources in [sources.yml](./models/sources/sources.yml) files

1. Build staging models:
```bash
dbt build -s tag:staging
```
Check model configurations: `engine`, `order_by`, `partition_by`
1. Prepare wide table (Data Mart)
Join all the tables into one [f_lineorder_flat](./models/):
```bash
dbt build -s f_lineorder_flat
```
Pay attentions to models being tested for keys being unique, not null.
## 5. Model read-optimized Data Mart
Turn the following SQL into dbt model [f_orders_stats](./models/marts/f_orders_stats.sql):
```sql
SELECT
toYear(O_ORDERDATE) AS O_ORDERYEAR
, O_ORDERSTATUS
, O_ORDERPRIORITY
, count(DISTINCT O_ORDERKEY) AS num_orders
, count(DISTINCT C_CUSTKEY) AS num_customers
, sum(L_EXTENDEDPRICE * L_DISCOUNT) AS revenue
FROM f_lineorder_flat
WHERE 1=1
GROUP BY
toYear(O_ORDERDATE)
, O_ORDERSTATUS
, O_ORDERPRIORITY
```

Make sure the tests pass:

```bash
dbt build -s f_orders_stats
```

## 6. Create PR and make CI tests pass

![Github Actions check passed](./docs/github_checks_passed.png)

## Shut down your cluster

⚠️ Attention! Always delete resources after you finish your work!
Expand All @@ -245,18 +246,3 @@ dbt build -s f_orders_stats
```bash
terraform destroy
```

## Lesson plan

- [ ] Deploy Clickhouse
- [ ] Configure development environment
- [ ] Configure dbt project (`dbt_project.yml`)
- [ ] Configure connection (`profiles.yml`)
- [ ] Prepare source data files (S3)
- [ ] Configure EXTERNAL TABLES (S3)
- [ ] Describe sources in .yml files
- [ ] Basic dbt models and configurations
- [ ] Code compilation + debugging
- [ ] Prepare STAR schema
- [ ] Querying results
- [ ] Testing & Documenting your project
Binary file added docs/airbyte_destination_clickhouse.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_destination_s3_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_destination_s3_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_source_postgres.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_sync_clickhouse_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_sync_clickhouse_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_sync_clickhouse_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_sync_s3_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_sync_s3_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_sync_s3_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/airbyte_ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/github_checks_passed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/github_codespaces.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
111 changes: 111 additions & 0 deletions macros/init_s3_sources.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
{% macro init_s3_sources() -%}

{% set sources = [
'DROP TABLE IF EXISTS src_customer'
, 'CREATE TABLE IF NOT EXISTS src_customer
(
C_CUSTKEY UInt32,
C_NAME String,
C_ADDRESS String,
C_NATIONKEY UInt32,
C_PHONE String,
C_ACCTBAL Decimal(15,2),
C_MKTSEGMENT LowCardinality(String),
C_COMMENT String
)
ENGINE = S3(\'https://storage.yandexcloud.net/otus-dwh/tpch-dbgen-1g/customer.tbl\', \'CustomSeparated\')
SETTINGS
format_custom_field_delimiter=\'|\'
,format_custom_escaping_rule=\'CSV\'
,format_custom_row_after_delimiter=\'|\n\'
'
, 'DROP TABLE IF EXISTS src_orders'
, 'CREATE TABLE src_orders
(
O_ORDERKEY UInt32,
O_CUSTKEY UInt32,
O_ORDERSTATUS LowCardinality(String),
O_TOTALPRICE Decimal(15,2),
O_ORDERDATE Date,
O_ORDERPRIORITY LowCardinality(String),
O_CLERK String,
O_SHIPPRIORITY UInt8,
O_COMMENT String
)
ENGINE = S3(\'https://storage.yandexcloud.net/otus-dwh/tpch-dbgen-1g/orders.tbl\', \'CustomSeparated\')
SETTINGS
format_custom_field_delimiter=\'|\'
,format_custom_escaping_rule=\'CSV\'
,format_custom_row_after_delimiter=\'|\n\'
'
, 'DROP TABLE IF EXISTS src_lineitem'
, 'CREATE TABLE src_lineitem
(
L_ORDERKEY UInt32,
L_PARTKEY UInt32,
L_SUPPKEY UInt32,
L_LINENUMBER UInt8,
L_QUANTITY Decimal(15,2),
L_EXTENDEDPRICE Decimal(15,2),
L_DISCOUNT Decimal(15,2),
L_TAX Decimal(15,2),
L_RETURNFLAG LowCardinality(String),
L_LINESTATUS LowCardinality(String),
L_SHIPDATE Date,
L_COMMITDATE Date,
L_RECEIPTDATE Date,
L_SHIPINSTRUCT String,
L_SHIPMODE LowCardinality(String),
L_COMMENT String
)
ENGINE = S3(\'https://storage.yandexcloud.net/otus-dwh/tpch-dbgen-1g/lineitem.tbl\', \'CustomSeparated\')
SETTINGS
format_custom_field_delimiter=\'|\'
,format_custom_escaping_rule=\'CSV\'
,format_custom_row_after_delimiter=\'|\n\'
'
, 'DROP TABLE IF EXISTS src_part'
, 'CREATE TABLE src_part
(
P_PARTKEY UInt32,
P_NAME String,
P_MFGR LowCardinality(String),
P_BRAND LowCardinality(String),
P_TYPE LowCardinality(String),
P_SIZE UInt8,
P_CONTAINER LowCardinality(String),
P_RETAILPRICE Decimal(15,2),
P_COMMENT String
)
ENGINE = S3(\'https://storage.yandexcloud.net/otus-dwh/tpch-dbgen-1g/part.tbl\', \'CustomSeparated\')
SETTINGS
format_custom_field_delimiter=\'|\'
,format_custom_escaping_rule=\'CSV\'
,format_custom_row_after_delimiter=\'|\n\'
'
, 'DROP TABLE IF EXISTS src_supplier'
, 'CREATE TABLE src_supplier
(
S_SUPPKEY UInt32,
S_NAME String,
S_ADDRESS String,
S_NATIONKEY UInt32,
S_PHONE String,
S_ACCTBAL Decimal(15,2),
S_COMMENT String
)
ENGINE = S3(\'https://storage.yandexcloud.net/otus-dwh/tpch-dbgen-1g/supplier.tbl\', \'CustomSeparated\')
SETTINGS
format_custom_field_delimiter=\'|\'
,format_custom_escaping_rule=\'CSV\'
,format_custom_row_after_delimiter=\'|\n\'
'
] %}

{% for src in sources %}
{% set statement = run_query(src) %}
{% endfor %}

{{ print('Initialized source tables – TPCH (S3)') }}

{%- endmacro %}
Loading

0 comments on commit 8c83aee

Please sign in to comment.