NOTE: This project is still very much a work in progress, with much of the larger model restructuring still to come, see the TODO file for more info.
First and foremost, this project is based off of the dbt GA4 Package by Velir, but has been modified and refactored for internal purposes. This project uses Google Analytics 4 BigQuery Exports as its source data, and offers useful base transformations to provide report-ready dimension & fact models that can be used for reporting purposes, blending with other data, and/or feature engineering for ML models.
Find more info about Google Analytics 4 BigQuery Exports here.
Features Overview:
- Four final tables—
ga4__events
,ga4__pages
,ga4__sesssions
, andga4__users
—that are completed unnested to be wide & denomalized for easy querying by the end-user. - Conversion of the the day-shared
events_YYYYMMDD
&events_intraday_YYYYMMDD
tables into singular date-partitionioned incremental base models. - Dynamically flattens
event_params
into their own individual columns. - Dynamically flattens
user_props
into their own individual columns. - Dynamically extracts & flattens URL
query_params
(e.g.,gclid
,fbclid
,_ga
) into their own individual columns. - Custom
Variables
. See here for more info. - Custom
Marcros
. See here for more info.
This project and any future projects that may be based off of this intial dbt_ga4_project
, will be following This Project's Style Guide...IN PROGRESS, which borrows ideals from the following Style Guides:
DAG Overview
NOTE: This DAG Image is NOT current & will continue to CHANGE until all models are finalized.
Model Name | Description |
---|---|
ga4__events | This is the table for event-level metrics & dimensions, that has been transformed to be wide & denomalized for easier quering. |
ga4__pages | This is the table for page-level metrics & dimensions, such as page_views , exits , and users . This table is grouped by page_title , event_date , and page_path . |
ga4__sessions | This is the table for session-level metrics & dimensions, such as is_engaged_session , engagement_duration , and page_views . This table is grouped by both session_key and user_key . |
ga4__users | This is the table for user-level metrics & dimensions, such as first & last_seen_date , geo , and traffic_source . This table is grouped by the hashed user_key dimension, which is based on user_id , or user_pseudo_id if one doesn't exist. |
Model Name | Description |
---|---|
stg_ga4__events | Creates a table with event data that is enhanced with useful event_keys , page_keys , session keys , and user_keys . |
stg_ga4__event_params | Creates a table that unnests all of the event parameters specific to each event (e.g. page_view , click , or scroll ), except for those marked in the dbt_project.yml file. |
stg_ga4__traffic_sources | Creates a table that designates a default_channel_grouping via the source , medium , campaign columns. |
stg_ga4__user_props | Creates a table that unnests the user_properties , except for those marked in the dbt_project.yml file. |
stg_ga4__query_params | Maps any and all query parameters (e.g. gclid , fbclid , etc.) contained in each event's page_location . |
stg_ga4__conversions | Creates a table for the events that you mark as a conversion_event in the dbt_project.yml file. |
int_ga4__events_joined | ...[TO DO]... |
int_ga4__pages_grouped | ...[TO DO]... |
int_ga4__sessions_grouped | ...[TO DO]... |
int_ga4__users_grouped | ...[TO DO]... |
NOTE: These Macros are also not finalized & are likely to change.
get_first(by_column_name
,from_column_name
) source
This macro returns the FIRST
position of a specified from_column_name
, which is partioned by the by_column_name
.
Args:
by_column_name
(required): The name of the column which you want to partition your selction by.from_column_name
(required): The name of the column to get the first value of.
Usage:
{{ get_first('<by_column_name>', '<from_column_name>') }}
Example: Get the landing_page of a corresponding Session by selecting the first page_path
using that Session's session_key
.
SELECT
{{ get_first('session_key', 'page_path') }} AS landing_page
...
get_last(by_column_name
,from_column_name
) source
This macro returns the LAST
position of a specified from_column_name
, which is partioned by the by_column_name
.
Args:
by_column_name
(required): The name of the column which you want to partition your selction by.from_column_name
(required): The name of the column to get the last value of.
Usage:
{{ get_last('<by_column_name>', '<from_column_name>') }}
Example: Get the last event_key
for a corresponding Session using that Session's session_key
.
SELECT
{{ get_last('session_key', 'event_key') }} AS last_session_event_key,
...
extract_hostname_from_url(url
) source
This macro extracts the hostname
from a column containing a url
.
Args:
url
(required): The column containting URLs.
Usage:
{{ extract_hostname_from_url('<url>') }}
Example: Extract the hostname
from the page_location
column.
SELECT
{{ extract_hostname_from_url('page_location') }} AS page_hostname,
...
extract_query_string_from_url(url
) source
This macro extracts the query_string
from a column containing a url
.
Args:
url
(required): The column containting URLs.
Usage:
{{ extract_query_string_from_url('<url>') }}
Example: Extract the query_string
from the page_location
column.
SELECT
{{ extract_query_string_from_url('page_location') }} AS page_query_string,
...
remove_query_parameters(url
, [parameters]
) source
This macro removes the specified parameters
from a column containing a url
.
Args:
url
(required): The column containting URLs.parameters
(required, default=[]
): A list of query parameters to remove from the URL.
Usage:
{{ remove_query_parameters('<url>', '[parameters]') }}
Example: Remove the parameters: gclid
, fbclid
, and _ga
from the page_location
column.
{% set parameters = ['gclid','fbclid','_ga'] %}
SELECT
{{ remove_query_parameters('page_location', parameters) }} AS clean_page_location,
...
unnest_by_key(column_to_unnest
, key_to_extract
, value_type
= "string") source
This macro unnests a single key's value from an array. This macro will dynamically alias the sub-query with the name of the column_to_unnest
.
Args:
column_to_unnest
(required): The array column to unnest the key's value from.key_to_extract
(required): The key by which to get the corresponding value for.value_type
(optional, default="string"): The data type of the key's value column.
Usage:
{{ unnest_by_key('<column_to_unnest>', '<key_to_extract>', '<value_type>') }}
Example: Unnest the corresponding values for the keys: page_location
and ga_session_number
from the nested event_params
column.
SELECT
-- Unnest the default STRING value type
{{ unnest_by_key('event_params', 'page_location') }},
-- Unnest the INT value type
{{ unnest_by_key('event_params', 'ga_session_number', 'int') }},
...
unnest_by_key_alt(column_to_unnest
, key_to_extract
, value_type
= "string") source
This macro unnests a single key's value from an array. This macro allows for a custom alias named sub-query.
Args:
column_to_unnest
(required): The array column to unnest the key's value from.key_to_extract
(required): The key by which to get the corresponding value for.value_type
(optional, default="string"): The data type of the key's value column.
Usage:
{{ unnest_by_key_alt('<column_to_unnest>', '<key_to_extract>', '<value_type>') }} AS <custom_alias_name>,
Example: Unnest the corresponding values for the keys: page_location
and ga_session_number
from the nested event_params
column.
SELECT
-- Unnest the default STRING value type & use a custom alias
{{ unnest_by_key_alt('event_params', 'page_location') }} AS url,
-- Unnest the INT value type & use a custom alias
{{ unnest_by_key_alt('event_params', 'ga_session_number', 'int') }} AS session_number,
...
get_event_params() source
This macro will dynamically return all of the keys
and their corresponding value_types
found in the event_params
array column.
- This macro will exclude event_params added to the
excluded_event_params
variable, which is specified in thedbt_project.yml
file.
Usage / Example:
SELECT
{% for event_param in get_event_params() -%}
{{ unnest_by_key('event_params', event_param['event_param_key'], event_param['event_param_value']) }}
{{- "," if not loop.last }}
{% endfor %}
...
default_channel_grouping(source
, medium
, source_category
) source
This macro determines the default_channel_grouping
and will result in one the following classifications:
Direct
Paid Social
Oraginc Social
Email
Affiliates
Paid Shopping
Paid Search
Display
Other Advertising
Organic Search
Organic Video
Organic Shopping
Audio
SMS
(Other)
Args:
source
(required): The source column used in determining the default channel grouping.medium
(required): The medium column used in determining the default channel grouping.source_category
(required): The source category column used in determining the default channel grouping. These are desiganted in thega4_source_categories.csv
seed file.
Usage:
{{ default_channel_grouping('<source>', '<medium>', '<source_category>') }}
Example:
SELECT
{{ default_channel_grouping('source', 'medium', 'source_category') }} AS default_channel_grouping,
...
Seed File | Description |
---|---|
ga4_source_categories.csv | Google's mapping between source and source_category . More info and the download can be found here. |
Make sure to run dbt seed
before running dbt run
.
...[TO DO]...
This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is located using the following variables which must be set in your dbt_project.yml
file.
vars:
project: '<gcp_project>' # Set your Project ID here.
dataset: '<ga4_dataset>' # Set your Dataset name here.
start_date: 'YYYYMMDD' # Set the start date that you want to retrieve data from.
frequency: 'daily' # daily|streaming|daily+streaming Match to the type of export configured in GA4; daily+streaming appends today's intraday data to daily data.
If you don't have any GA4 data of your own, you can connect to Google's public data set with the following settings:
vars:
project: 'bigquery-public-data'
dataset: 'ga4_obfuscated_sample_ecommerce'
start_date: '20210120'
Find more info about the GA4 obfuscated dataset here.
NOTE: These Variables are also NOT finalized & are LIKELY to change.
Setting any query_parameter_exclusions
will remove query string parameters from the page_location
field for all downstream processing. Original parameters are captured in a new original_page_location
field. Ex:
vars:
query_parameter_exclusions: ['gclid', 'fbclid', '_ga']
Specific events can be set as conversions with the conversion_events
variable in your dbt_project.yml
file. These events will be counted against each session and included in the final mart models. Ex:
vars:
conversion_events: ['purchase', 'download']
Specific events can be set as considerations with the conversion_events
variable in your dbt_project.yml
file. These events will be counted against each session and included in the final mart models. Ex:
vars:
consideration_events: ['cta_click', 'view_search_results']
Set specific events to be stages in a funnel.
vars:
funnel_stages: ['begin_checkout', 'add_shipping_info', 'add_payment_info', 'purchase']
Exclude specific events from the final tables.
vars:
excluded__events: ['session_start']
Exclude specific event parameters from the final tables.
vars:
excluded__event_params: ['ga_session_id', 'page_location', 'ga_session_number', 'session_engaged', 'engagement_time_msec', 'entrances', 'page_title', 'page_referrer', 'source', 'medium', 'campaign', 'debug_mode', 'term', 'clean_event', 'value', 'tax', 'coupon', 'promotion_name', 'transaction_id']
Exclude specific default columns from the final tables.
vars:
excluded__columns: ['event_previous_timestamp', 'event_bundle_sequence_id', 'event_server_timestamp_offset', 'user_id', 'user_pseudo_id', 'stream_id', 'ga_session_id', 'privacy_info', 'event_dimensions', 'app_info']
Exclude specific user properties from the final tables.
vars:
excluded__user_props: ['logged_in']
Include specific query parameters to be in the final tables.
vars:
included__query_params: ['utm_source', 'utm_medium', 'utm_campaign', 'utm_content', 'utm_term', 'gclid', 'fbclid', 'gclsrc', '_ga']
- GA4 Resources:
- SQL & BigQuery Resources:
- dbt Resources:
- Project References:
- GA4 dbt Package
- Stacktonic dbt Example Project
- Also inspired by this