Setting up Clickhouse cluster of base and predict customer shopping data with visualisation of results
Tools:
- EDA, data processing, feature engineering, machine learning:
- Python with Google colab enviroment:
- Pandas
- Numpy
- CatBoost
- Sklearn
- Developer Enviroment:
- Devcontainer
- Deploy Infrastructure:
- Yandex.Cloud
- Terraform
- Data Pipelines:
- Airbyte
- Data modeling:
- DBT
- Data visualization
- Yandex DataLens
- Power BI
- Creating a Forecast Using ML Methods in Google Colab
- Configure Developer Environment
- Deploy Infrastructure to Yandex.Cloud with Terraform
- Get familiar with Yandex.Cloud web UI
- Configure
yc
CLI - Populate
.env
file, Set environment variables - Deploy using Terraform: VM with Airbyte installed, S3 Bucket, Clickhouse
- Access Airbyte
- Configure Data Pipelines
- Configure Object Storage Source
- Configure Clickhouse Destination
- Sync data to Destination
- Create data marts with dbt
- Create DBT Model
- Run DBT model
- Create dashboards with Yandex Datalens and Power BI
- Delete cloud resources
-
Install Docker on your local machine.
-
Install devcontainer CLI:
Open command palette (CMD + SHIFT+ P) type Install devcontainer CLI
-
Next build and open dev container:
# build dev container devcontainer build . # open dev container devcontainer open .
Verify you are in a development container by running commands:
terraform -v
yc --version
dbt --version
If any of these commands fails printing out used software version then you are probably running it on your local machine not in a dev container!
-
Get familiar with Yandex.Cloud web UI
We will deploy:
-
Configure
yc
CLI: Getting started with the command-line interface by Yandex Cloudyc init
-
Populate
.env
file.env
is used to store secrets as environment variables.Copy template file .env.template to
.env
file:cp .env.template .env
Open file in editor and set your own values.
❗️ Never commit secrets to git
-
Set environment variables:
export YC_TOKEN=$(yc iam create-token) export YC_CLOUD_ID=$(yc config get cloud-id) export YC_FOLDER_ID=$(yc config get folder-id) export TF_VAR_folder_id=$(yc config get folder-id) export $(xargs < .env) ## DEBUG # export TF_LOG_PATH=./terraform.log # export TF_LOG=trace
-
Deploy using Terraform
Configure YC Terraform provider:
cp terraformrc ~/.terraformrc
Get familiar with Cloud Infrastructure: main.tf and variables.tf
terraform init terraform validate terraform fmt terraform plan terraform apply
Store terraform output values as Environment Variables:
export CLICKHOUSE_HOST=$(terraform output -raw clickhouse_host_fqdn) export DBT_HOST=${CLICKHOUSE_HOST} export DBT_USER=${CLICKHOUSE_USER} export DBT_PASSWORD=${TF_VAR_clickhouse_password}
[EN] Reference: Getting started with Terraform by Yandex Cloud
[RU] Reference: Начало работы с Terraform by Yandex Cloud
-
Get VM's public IP:
terraform output -raw yandex_compute_instance_nat_ip_address
-
Lab's VM image already has Airbyte installed
I have prepared VM image and made it publicly available:
https://cloud.yandex.com/en-ru/docs/compute/concepts/image#public
yc resource-manager cloud add-access-binding y-cloud \ --role compute.images.user \ --subject system:allAuthenticatedUsers
TODO: define VM image with Packer so that everyone is able to build his own image
However if you'd like to do it yourself:
ssh airbyte@{yandex_compute_instance_nat_ip_address} sudo mkdir airbyte && cd airbyte sudo wget https://raw.githubusercontent.com/airbytehq/airbyte-platform/main/{.env,flags.yml,docker-compose.yaml} sudo docker-compose up -d
-
Log into web UI at {yandex_compute_instance_nat_ip_address}:8000
With credentials:
airbyte password
-
Put the data to object storage
-
Configure Object Storage Source
-
Configure Clickhouse Destination
terraform output -raw clickhouse_host_fqdn
-
Sync the data to Clickhouse Destination for each source
-
Check the status of replication for each source
-
Check the data in Clickhouse
Export variables to allow connection to Clickhouse in your Yandex Cloud:
export CLICKHOUSE_HOST=$(terraform output -raw clickhouse_host_fqdn)
export DBT_HOST=${CLICKHOUSE_HOST}
export DBT_USER=${CLICKHOUSE_USER}
export DBT_PASSWORD=${TF_VAR_clickhouse_password}
Make sure it works:
dbt debug
Run data modeling:
dbt build
Check your marts in Clickhouse:
-
First change the settings in Clickhouse cluster to access the DataLens tool
-
Set your source
-
Set the data model
-
Create measures
-
Create the dashboard
https://github.com/neworderby/dbt_ml_retail/blob/main/Power%20BI/Retail%20Dashboard.pbix
terraform destroy