-
Notifications
You must be signed in to change notification settings - Fork 11
1 Settings for the Data Warehouse and the ETL
The configuration files for Arthur govern:
- Where data is loaded from (upstream databases or S3 locations)
- Where intermediate data is stored
- Which schemas and tables are created as outcome of transformations
- What users and groups exist and what their privileges are
- What errors cannot or can be ignored during an ETL run
- Where ETL events are sent to (see also 3-Logging-and-event-tables)
- How types in upstream databases can be (safely?) coerced into data types available in Redshift
This configuration is separate from the description of the schema (and attributes) of relations or any queries related to creating or testing relations.
Details specific to the configuration of the data warehouse are in 1.1 Data Warehouse.
Configuration for AWS is in 1.2 AWS.
If the environment variable DATA_WAREHOUSE_CONFIG
is set, then the specified file in its value or all files in the specified directory are searched for configuration files first.
You can specify additional files or directories using the --config
command line option.
Some default settings are in the defaults file. But you will have to add at least one more configuration with settings for a data-warehouse.
Two concepts are important here
- Files are read in alphabetical order.
- Values in latter files overwrite values from earlier ones. 2.5 Avoid a hierarchy inside configuration directories (*)
Note During deployment, we'll flatten all configuration files into one config
directory.
We suggest that you deal with credentials that allow connections to upstream sources or the data warehouse itself by using three different credential files.
Use a credentials.sh
file to store
- upstream credentials
- credentials to connect to a cluster to be used for development
This file should be locally in your config
directory. You will have to copy it in s3 into all directories that are part of your development process.
Use a credentials_validation.sh
file to store credentials pointing at a scratch database that you can use for validation. This should probably live on the development cluster.
This file could live locally in a validation
directory. It needs to contain settings for DATA_WAREHOUSE_ADMIN
and DATA_WAREHOUSE_ETL
to connect your cluster. You can then use
arthur.py -c validation initialize
arthur.py -c validation load --skip-copy
Note that Arthur will read both files, config/credentials.sh
(because it's in the default directory) and validation/config_validation.sh
(because it's on the command line). Settings in config_validation.sh
will override earlier settings in credentials.sh
.
Use a credentials_production.sh
file to store credentials to the production database. Guard well.
If you add it to the production
directory, then you can use
arthur.py -c production load
Note that Arthur will read both files, config/credentials.sh
(because it's in the default directory) and production/config_production.sh
(because it's on the command line). Settings in config_production.sh
will override earlier settings in credentials.sh
.