This repository contains the dbt project responsible for the transformations in the larger ELT process of the SPalojoki Data Platform.
The extract and load processes, along with the daily production builds, are orchestrated in a separate spalojoki-dataplatform-composer repository containing Airflow DAGs.
SPalojoki Data Platform follows the medallion architecture introduced by Databricks.
Medallion architecture transforms data through three distinct layers: bronze, silver, and gold.
-
Bronze Layer
- Description: The landing dataset for all data from external source systems.
- Structure: Mirrors the source system tables "as-is" with additional metadata columns (e.g., load date/time).
- Purpose: Maintains a historical archive of source data, enabling data lineage, auditability, and reprocessing without rereading the source data.
-
Silver Layer
- Description: Processes data from the Bronze layer by matching, merging, and cleaning.
- Structure: Uses 3rd Normal Form (3NF) or similar structures to maximize reusability.
- Purpose: Ensures data consistency and quality.
-
Gold Layer
- Description: Contains consumption-ready, project-specific datasets.
- Structure: Uses de-normalized, read-optimized data models with fewer joins.
- Purpose: Final data transformations, aggregations, and data quality checks for analytics, etc..
BigQuery uses a three-level structure for organizing data:
- Google Cloud Project
- Datasets
- Tables
SPalojoki Data Platform organizes all data assets across environments (production, development) under one Google Cloud Project for simplicity. Development datasets are differentiated from production by adding developer initials to the dataset name.
{developer_initials}__dataset_name
Tables are divided into different datasets based on their refinement level, following the Medallion Architecture. There are five dataset types:
The Landing dataset stores data from source systems "as-is," representing the bronze layer in the Medallion Architecture. Each source can have one or more tables within this dataset. The tables in the Landing dataset are configured in dbt as source tables, thereby acting as the single source of truth across both development and production environments.
The Staging dataset contains dbt models that pull data from the Landing dataset into the dbt project. Essentially, it serves as a close duplicate of the Landing tables but within the dbt project, and can also be considered part of the bronze layer in the Medallion Architecture.
The Refined dataset represents the silver layer. This dataset contains models that structure data in 3rd Normal Form (3NF) or 3NF-like models. These models are designed to ensure data consistency and quality, making the data more reusable for downstream processing.
The Publish dataset corresponds to the gold layer and contains aggregated, use-case specific models. These models are de-normalized and read-optimized, designed to meet the needs of specific reporting and data consumption use cases.
The Intermediate dataset includes transformations that occur between the staging, refined, and publish datasets. These models are used to improve code clarity and manage complex transformations that don't fit neatly into the other categories.
Datasets are differentiated by appending a type suffix to the dataset name:
dataset_name__{dataset_type}
The dataset name is analytics
. Consequently, the combined naming convention is:
{developer_initials}__analytics__{dataset_type}
This dbt project uses a local dbt-core environment. Once your local environment is set up, you can develop, test, and run data transformations on development datasets before deploying them to production.
dbt build
: Compiles, runs, and tests your dbt models.dbt run
: Executes your dbt models.dbt test
: Runs tests on your dbt models.dbt seed
: Loads CSV files from theseeds
directory into the database.dbt compile
: Compiles your dbt models into SQL queries.
For more detailed documentation, refer to the dbt commands reference.
Use the --select
option to specify which models to build
, run
, or test
. The +
symbol includes upstream and downstream dependencies.
Example: Run a specific model and its upstream dependencies:
dbt run --select +my_model
-
Add or Modify SQL Files in the Models Directory
- Define your data transformations by adding or modifying SQL files in the
models
directory. - Use
dbt run
to test and iterate on these models. - Inspect results in your development datasets on Google Cloud Console or the dbt query result inspector.
- Define your data transformations by adding or modifying SQL files in the
-
Add/Update Documentation and Tests
- Add model and column definitions, as well as tests, to the
.yml
files forrefined
andpublish
models. - Validate tests by running
dbt test
ordbt build
.
- Add model and column definitions, as well as tests, to the
-
Commit Changes to Version Control
- Changes committed to the main branch are reflected in the production datasets.
- These changes take effect after the next production
dbt run
(either at UTC midnight or after the next data ingestion job). - To apply changes immediately to production, manually trigger the
daily dbt build
DAG in Airflow.
- Python
- Google Cloud CLI
-
Create a Python Virtual Environment and Activate It
python -m venv dbt-env source dbt-env/bin/activate
Ensure the venv directory is excluded in
.gitignore
if stored in the project root. -
Install necessary dependencies
pip install -r requirements.txt
-
Configure
profiles.yml
Create the following
profiles.yml
file in/Users/{user}/.dbt
and fill in the missing values:spalojoki_dataplatform: target: dev outputs: dev: type: bigquery method: oauth project: {{GCP_PROJECT_ID}} dataset: {{initials}}__analytics threads: 4
Set up Google Cloud CLI authentication:
gcloud auth application-default login
For alternative authentication methods, refer to the dbt documentation.
-
Check Configuration Run
dbt debug
to verify the configuration. -
Setup the linting pre-commit hook
pre-commit install
-
Optional: Install dbt Power User VS Code Plugin For lineage and other features.
Daily production builds are orchestrated using Airflow, with the DAGs defined in the spalojoki-dataplatform-composer repository. The production environment dbt profile
is stored in prod_profile
directory of this repository.
The main
branch serves as the source for production deployment. Changes merged into main are automatically applied during the next daily dbt build or after the subsequent data ingestion job. If you need to apply changes immediately to production, you can manually trigger the daily_dbt_build
DAG in Airflow.
For details on accessing and configuring Airflow, refer to the spalojoki-infrastructure repository.