Skip to content

datafold/terraform-google-datafold

Repository files navigation

=======

Datafold Google module

This repository provisions infrastructure resources on Google Cloud for deploying Datafold using the datafold-operator.

About this module

⚠️ Important: This module is now optional. If you already have GKE infrastructure in place, you can configure the required resources independently. This module is primarily intended for customers who need to set up the complete infrastructure stack for GKE deployment.

The module provisions Google Cloud infrastructure resources that are required for Datafold deployment. Application configuration is now managed through the datafoldapplication custom resource on the cluster using the datafold-operator, rather than through Terraform application directories.

Breaking Changes

Load Balancer Deployment (Default Changed)

Breaking Change: The load balancer is no longer deployed by default. The default behavior has been toggled to deploy_lb = false.

  • Previous behavior: Load balancer was deployed by default
  • New behavior: Load balancer deployment is disabled by default
  • Action required: If you need a load balancer, you must explicitly set deploy_lb = true in your configuration, so that you don't lose it. (in the case it does happen, you need to redeploy it and then update your DNS to the new LB IP).

Application Directory Removal

  • The "application" directory is no longer part of this repository
  • Application configuration is now managed through the datafoldapplication custom resource on the cluster

Prerequisites

  • A Google Cloud account, preferably a new isolated one.
  • Terraform >= 1.4.6
  • A customer contract with Datafold
    • The application does not work without credentials supplied by sales
  • Access to our public helm-charts repository

The full deployment will create the following resources:

  • Google VPC
  • Google subnets
  • Google GCS bucket for clickhouse backups
  • Google Cloud Load Balancer (optional, disabled by default)
  • Google-managed SSL certificate (if load balancer is enabled)
  • Three persistent disk volumes for local data storage
  • Cloud SQL PostgreSQL database
  • A GKE cluster
  • Service accounts for the GKE cluster to perform actions outside of its cluster boundary:
    • Provisioning persistent disk volumes
    • Updating Network Endpoint Group to route traffic to pods directly
    • Managing GCS bucket access for ClickHouse backups

Infrastructure Dependencies: For a complete list of required infrastructure resources and detailed deployment guidance, see the Datafold Dedicated Cloud GCP Deployment Documentation.

Negative scope

  • This module will not provision DNS names in your zone.

How to use this module

  • See the example for a potential setup, which has dependencies on our helm-charts

The example directory contains a single deployment example for infrastructure setup.

Setting up the infrastructure:

  • It is easiest if you have full admin access in the target project.
  • Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.
    • Use the alias instead of the mrk link. Put that into locals.tf
  • Certificate Requirements (depends on load balancer deployment method):
    • If deploying load balancer from this Terraform module (deploy_lb = true): Pre-create and validate the SSL certificate in your DNS, then refer to that certificate in main.tf using its domain name (Replace "datafold.example.com")
    • If deploying load balancer from within Kubernetes: The certificate will be created automatically, but you must wait for it to become available and then validate it in your DNS after the deployment is complete
  • Change the settings in locals.tf
    • provider_region = which region you want to deploy in.
    • project_id = The GCP project ID where you want to deploy.
    • kms_profile = The profile you want to use to issue the deployments. Targets the deployment account.
    • kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
    • deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
  • Run terraform init in the infra directory.
  • Run terraform apply in infra directory. This should complete ok.
    • Check in the console if you see the GKE cluster, Cloud SQL database, etc.
    • If you enabled load balancer deployment, check for the load balancer as well.
    • The configuration values needed for application deployment will be output to the console after the apply completes.

Application Deployment: After infrastructure is ready, deploy the application using the datafold-operator. Continue with the Datafold Helm Charts repository to deploy the operator manager and then the application through the operator. The operator is the default and recommended method for deploying Datafold.

Infrastructure Dependencies

This module is designed to provide the complete infrastructure stack for Datafold deployment. However, if you already have GKE infrastructure in place, you can choose to configure the required resources independently.

Required Infrastructure Components:

  • GKE cluster with appropriate node pools
  • Cloud SQL PostgreSQL database
  • GCS bucket for ClickHouse backups
  • Persistent disks for persistent storage (ClickHouse data, ClickHouse logs, Redis data)
  • IAM roles and service accounts for cluster operations
  • Load balancer (optional, can be managed by Google Cloud Load Balancer Controller)
  • VPC and networking components
  • SSL certificate (validation timing depends on deployment method):
    • Terraform-managed LB: Certificate must be pre-created and validated
    • Kubernetes-managed LB: Certificate created automatically, validated post-deployment

Alternative Approaches:

  • Use this module: Provides complete infrastructure setup for new deployments
  • Use existing infrastructure: Configure required resources manually or through other means
  • Hybrid approach: Use this module for some components and existing infrastructure for others

For detailed specifications of each required component, see the Datafold Dedicated Cloud GCP Deployment Documentation. For application deployment instructions, continue with the Datafold Helm Charts repository to deploy the operator manager and then the application through the operator.

Detailed Infrastructure Components

Based on the Datafold GCP Deployment Documentation, this module provisions the following detailed infrastructure components:

Persistent Disks

The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine persistent disks in the primary availability zone:

  • ClickHouse data disk: Serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements.
  • ClickHouse logs disk: Stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage.
  • Redis data disk: Provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts.

All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest.

Load Balancer

The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies:

  • External Load Balancer Deployment (the default approach): Creates a Google Cloud Load Balancer through Terraform
  • Kubernetes-Managed Load Balancer: Relies on the Google Cloud Load Balancer Controller running within the GKE cluster, deployed by the datafold application resource. This means Kubernetes creates the load balancer for you.

GKE Cluster

The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application:

  • Network Architecture: The entire cluster is deployed into private subnets with Cloud NAT for egress traffic
  • Security Features: Workload Identity, Shielded nodes, Binary authorization, Network policy, and Private nodes
  • Node Management: Supports up to three managed node pools with automatic scaling

IAM Roles and Permissions

The IAM architecture follows the principle of least privilege:

  • GKE service account: Basic permissions for logging, monitoring, and storage access
  • ClickHouse backup service account: Custom role for ClickHouse to make backups and store them on Cloud Storage
  • Datafold service accounts: Pre-defined roles for different application components

Cloud SQL Database

The PostgreSQL Cloud SQL instance serves as the primary relational database:

  • Storage configuration: Starts with a 20GB initial allocation that can automatically scale up to 100GB
  • High availability: Intentionally disabled by default to reduce costs and complexity
  • Security and encryption: Always encrypts data at rest using Google-managed encryption keys

Initializing the application

After deploying the application through the operator (see the Datafold Helm Charts repository), establish a shell into the <deployment>-dfshell container. It is likely that the scheduler and server containers are crashing in a loop.

All we need to do is to run these commands:

  1. ./manage.py clickhouse create-tables
  2. ./manage.py database create-or-upgrade
  3. ./manage.py installation set-new-deployment-params

Now all containers should be up and running.

Requirements

Name Version
dns 3.2.1
google >= 6.27.0

Providers

Name Version
google >= 6.27.0
random n/a

Modules

Name Source Version
clickhouse_backup ./modules/clickhouse_backup n/a
database ./modules/database n/a
gke ./modules/gke n/a
load_balancer ./modules/load_balancer n/a
networking ./modules/networking n/a
project-iam-bindings terraform-google-modules/iam/google//modules/projects_iam n/a
project_factory_project_services terraform-google-modules/project-factory/google//modules/project_services ~> 18.0.0

Resources

Name Type

Inputs

Name Description Type Default Required
add_onprem_support_group Flag to add onprem support group for datafold-onprem-support@datafold.com bool true no
ch_machine_type The machine type for the ch GKE cluster nodes string "n2-standard-8" no
clickhouse_backup_sa_key SA key from secrets string "" no
clickhouse_data_disk_size Data volume size clickhouse number 40 no
clickhouse_db Db for clickhouse. string "clickhouse" no
clickhouse_gcs_bucket GCS Bucket for clickhouse backups. string "clickhouse-backups-abcguo23" no
clickhouse_get_backup_sa_from_secrets_yaml Flag to toggle getting clickhouse backup SA from secrets.yaml instead of creating new one bool false no
clickhouse_username Username for clickhouse. string "clickhouse" no
cloud_router_bgp Flag to toggle cloud router bgp
object(
{
asn = string
# advertise_mode = optional(string, "DEFAULT")
# advertised_groups = optional(list(string))
# advertised_ip_ranges = optional(
# list(
# object({
# range = string
# description = optional(string)
# })
# ),
# []
# )
keepalive_interval = optional(number)
}
)
null no
cloud_router_nats NATs to deploy on this router.
list(object({
name = string
nat_ip_allocate_option = optional(string)
source_subnetwork_ip_ranges_to_nat = optional(string)
nat_ips = optional(list(string), [])
min_ports_per_vm = optional(number)
max_ports_per_vm = optional(number)
udp_idle_timeout_sec = optional(number)
icmp_idle_timeout_sec = optional(number)
tcp_established_idle_timeout_sec = optional(number)
tcp_transitory_idle_timeout_sec = optional(number)
tcp_time_wait_timeout_sec = optional(number)
enable_endpoint_independent_mapping = optional(bool)
enable_dynamic_port_allocation = optional(bool)

log_config = optional(object({
enable = optional(bool, true)
filter = optional(string, "ALL")
}), {})

subnetworks = optional(list(object({
name = string
source_ip_ranges_to_nat = list(string)
secondary_ip_range_names = optional(list(string))
})), [])

}))
[] no
common_tags Common tags to apply to any resource map(string) n/a yes
create_ssl_cert True to create the SSL certificate, false if not bool false no
custom_node_pools Dynamic extra node pools
list(object({
name = string
enabled = bool
initial_node_count = number
machine_type = string
disk_size_gb = number
disk_type = string
spot = bool
taints = list(object({
key = string
value = string
effect = string
}))
min_node_count = number
max_node_count = number
max_surge = number
max_unavailable = number
labels = map(string)
}))
[] no
database_edition The edition of the database (ENTERPRISE or ENTERPRISE_PLUS). If null, automatically determined based on version. string null no
database_name The name of the database string "datafold" no
database_version Version of the database string "POSTGRES_15" no
datafold_intercom_app_id The app id for the intercom. A value other than "" will enable this feature. Only used if the customer doesn't use slack. string "" no
db_deletion_protection A flag that sets delete protection (applied in terraform only, not on the cloud). bool true no
default_node_disk_size Root disk size for a cluster node number 40 no
deploy_lb Allows a deploy with a not-yet-existing load balancer bool true no
deploy_neg_backend Set this to true to connect the backend service to the NEG that the GKE cluster will create bool true no
deploy_vpc_flow_logs Flag weither or not to deploy vpc flow logs bool false no
deployment_name Name of the current deployment. string n/a yes
domain_name Provide valid domain name (used to set host in GCP) string n/a yes
enable_ch_node_pool Whether to enable the ch node pool bool false no
environment Global environment tag to apply on all datadog logs, metrics, etc. string n/a yes
gcs_path Path in the GCS bucket to the backups string "backups" no
github_endpoint URL of Github endpoint to connect to. Useful for Github Enterprise. string "" no
gitlab_endpoint URL of Gitlab endpoint to connect to. Useful for GItlab Enterprise. string "" no
host_override A valid domain name if they provision their own DNS / routing string "" no
k8s_authorized_networks Map of CIDR blocks that are able to connect to the K8S control plane map(string)
{
"0.0.0.0/0": "public"
}
no
k8s_cluster_version The version of Kubernetes to use for the GKE cluster. The patch/GKE specific version will be found automatically. string "1.28.11" no
k8s_deletion_protection If deletion protection is enabled (terraform feature) bool true no
k8s_maintenance_day Day for maintenance window. Valid values are MO,TU,WE,TH,FR,SA,SU string "WE" no
k8s_maintenance_end The end date and time for the maintenance window. string "2036-01-01T12:00:00Z" no
k8s_maintenance_start The start date and time for the maintenance window. string "2024-01-01T09:00:00Z" no
k8s_node_auto_upgrade Whether to enable auto-upgrade for the GKE cluster nodes bool true no
k8s_node_version The version of the nodes string "1.28.11" no
lb_app_rules Extra rules to apply to the application load balancer for additional filtering
list(object({
action = string
priority = number
description = string
match_type = string # can be either "src_ip_ranges" or "expr"
versioned_expr = string # optional, only used if match_type is "src_ip_ranges"
src_ip_ranges = list(string) # optional, only used if match_type is "src_ip_ranges"
expr = string # optional, only used if match_type is "expr"
}))
[] no
lb_layer_7_ddos_defence Flag to toggle layer 7 ddos defence bool false no
legacy_naming Flag to toggle legacy behavior - like naming of resources bool true no
machine_type The machine type for the GKE cluster nodes string "e2-highmem-8" no
max_node_count The maximum number of nodes in the cluster number 6 no
mig_disk_type https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_type string "pd-balanced" no
postgres_allocated_storage The amount of allocated storage for the postgres database number 20 no
postgres_instance GCP instance type for PostgreSQL database.
Available instance groups: .
Available instance classes: .
string "db-custom-2-7680" no
postgres_ro_username Postgres read-only user name string "datafold_ro" no
postgres_username The username to use for the postgres CloudSQL database string "datafold" no
project_id The project to deploy to, if not set the default provider project is used. string n/a yes
provider_azs Provider AZs list, if empty we get AZs dynamically list(string) n/a yes
provider_region Region for deployment in GCP string n/a yes
redis_data_size Redis volume size number 50 no
remote_storage Type of remote storage for clickhouse backups. string "gcs" no
restricted_roles Flag to stop certain IAM related resources from being updated/changed bool false no
restricted_viewer_role Flag to stop certain IAM related resources from being updated/changed bool false no
ssl_cert_name Provide valid SSL certificate name in GCP OR ssl_private_key_path and ssl_cert_path string "" no
ssl_cert_path SSL certificate path string "" no
ssl_private_key_path Private SSL key path string "" no
vpc_cidr Network CIDR for VPC string "10.0.0.0/16" no
vpc_flow_logs_interval Interval for vpc flow logs string "INTERVAL_5_SEC" no
vpc_flow_logs_sampling Sampling for vpc flow logs string "0.5" no
vpc_id Provide ID of existing VPC if you want to omit creation of new one string "" no
vpc_master_cidr_block cidr block for k8s master, must be a /28 block. string "192.168.0.0/28" no
vpc_secondary_cidr_pods Network CIDR for VPC secundary subnet 1 string "/17" no
vpc_secondary_cidr_services Network CIDR for VPC secundary subnet 2 string "/17" no
whitelist_all_ingress_cidrs_lb Normally we filter on the load balancer, but some customers want to filter at the SG/Firewall. This flag will whitelist 0.0.0.0/0 on the load balancer. bool false no
whitelisted_egress_cidrs List of Internet addresses to which the application has access list(string) n/a yes
whitelisted_ingress_cidrs List of CIDRs that can access the HTTP/HTTPS list(string) n/a yes

Outputs

Name Description
clickhouse_backup_sa Name of the clickhouse backup Service Account
clickhouse_data_size Size in GB of the clickhouse data volume
clickhouse_data_volume_id Volume ID of the clickhouse data PD volume
clickhouse_gcs_bucket Name of the GCS bucket for the clickhouse backups
clickhouse_logs_size Size in GB of the clickhouse logs volume
clickhouse_logs_volume_id Volume ID of the clickhouse logs PD volume
clickhouse_password Password to use for clickhouse
cloud_provider The cloud provider creating all the resources
cluster_name The name of the GKE cluster that was created
db_instance_id The database instance ID
deployment_name The name of the deployment
domain_name The domain name on the HTTPS certificate
lb_external_ip The load balancer IP when it was provisioned.
neg_name The name of the Network Endpoint Group where pods need to be registered from kubernetes.
postgres_database_name The name of the postgres database
postgres_host The hostname of the postgres database
postgres_password The postgres password
postgres_port The port of the postgres database
postgres_username The postgres username
redis_data_size The size in GB of the redis data volume
redis_data_volume_id The volume ID of the Redis PD data volume
redis_password The Redis password
vpc_cidr The VPC CIDR range
vpc_id The ID of the Google VPC the cluster runs in.
vpc_selflink The ID of the Google VPC the cluster runs in.
vpc_subnetwork The subnet in which the cluster is created

About

A terraform module for deploying the Datafold infrastructure on Google cloud.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5