Skip to content

1 Settings for the Data Warehouse and the ETL

Tom Vogels edited this page Aug 14, 2017 · 1 revision

Overview

The configuration files for Arthur govern:

  • Where data is loaded from (upstream databases or S3 locations)
  • Where intermediate data is stored
  • Which schemas and tables are created as outcome of transformations
  • What users and groups exist and what their privileges are
  • What errors cannot or can be ignored during an ETL run
  • Where ETL events are sent to (see also 3-Logging-and-event-tables)
  • How types in upstream databases can be (safely?) coerced into data types available in Redshift

This configuration is separate from the description of the schema (and attributes) of relations or any queries related to creating or testing relations.

Details specific to the configuration of the data warehouse are in 1.1 Data Warehouse.

Configuration for AWS is in 1.2 AWS.

"Search path" and loading

If the environment variable DATA_WAREHOUSE_CONFIG is set, then the specified file in its value or all files in the specified directory are searched for configuration files first. You can specify additional files or directories using the --config command line option.

Some default settings are in the defaults file. But you will have to add at least one more configuration with settings for a data-warehouse.

Two concepts are important here

  1. Files are read in alphabetical order.
  2. Values in latter files overwrite values from earlier ones. 2.5 Avoid a hierarchy inside configuration directories (*)

Note During deployment, we'll flatten all configuration files into one config directory.

Credentials

We suggest that you deal with credentials that allow connections to upstream sources or the data warehouse itself by using three different credential files.

Upstream databases and development

Use a credentials.sh file to store

  • upstream credentials
  • credentials to connect to a cluster to be used for development

This file should be locally in your config directory. You will have to copy it in s3 into all directories that are part of your development process.

Staging and validation

Use a credentials_validation.sh file to store credentials pointing at a scratch database that you can use for validation. This should probably live on the development cluster.

This file could live locally in a validation directory. It needs to contain settings for DATA_WAREHOUSE_ADMIN and DATA_WAREHOUSE_ETL to connect your cluster. You can then use

arthur.py -c validation initialize
arthur.py -c validation load --skip-copy

Note that Arthur will read both files, config/credentials.sh (because it's in the default directory) and validation/config_validation.sh (because it's on the command line). Settings in config_validation.sh will override earlier settings in credentials.sh.

Production

Use a credentials_production.sh file to store credentials to the production database. Guard well.

If you add it to the production directory, then you can use

arthur.py -c production load

Note that Arthur will read both files, config/credentials.sh (because it's in the default directory) and production/config_production.sh (because it's on the command line). Settings in config_production.sh will override earlier settings in credentials.sh.