-
Notifications
You must be signed in to change notification settings - Fork 11
1.1 Data Warehouse (Sources, transformations, users)
Each upstream source will have its own schema in the data warehouse. We support loading data from upstream databases (PostgreSQL) or S3 locations.
The sources
array should then contain objects which have a name
and include_tables
field. The other fields depend on the type of source.
The sources
array should contain objects which have a name
, read_access
and include_tables
field. The name
will be used later to create a schema in the data warehouse. The read_access
field is used to store the name of an environment variable that contains the connection string to the upstream database.
Note that arthur.py
can read a file with environment variables so that the variables don't necessarily have to be set when starting this script.
Each source also indicates what user groups should be able to select from or delete/insert/update the rows of its tables. Groups with only SELECT
access are listed in a readers
array, while groups that can modify the table contents are writers
.
Finally, include_tables
is a list of "glob" patterns to select which of the upstream tables should actually be imported into the data warehouse. You can also specify a list with exclude_tables
that lists "glob" patterns of tables that should be excluded.
Also, add the optional description
field to add a note about the upstream source.
"sources": [
{
"name": "www",
"read_access": "DATABASE_URI",
"include_tables": [ "public.*" ]
}
]
A better example:
"sources": [
{
"name": "www",
"description": "application database from our web site",
"read_access": "DATABASE_URI",
"include_tables": [ "public.*" ],
"exclude_tables": [ "public.passwords" ],
"readers": [ "analyst_ro" ],
"writers": [ "www_loader_rw" ]
}
]
We can also load data directly from locations in S3. This is useful for data that is either not changing (think reference data or historic data) or is generated by other systems that directly write into S3.
Readers and writers work as above.
"sources": [
{
"name": "static",
"s3_bucket": "my-company-library",
"s3_path_template": "${prefix}/static_files",
"include_tables": [ "library.*" ]
"readers": [ "analyst_ro" ],
"writers": [ "static_loader_rw" ]
}
]
The s3_path_template
may contain references to prefix
, today
and yesterday
. The value prefix
will be whatever is set in arthur.py
, meaning the user name or the value of the --prefix
argument. The values of today
and yesterday
are dates matching a "year/month/day" format.
The DW settings schemas
section denotes the namespaces where CTAS
or VIEW
relations are created.
These are not data sources, but the transformation result of the data sources. As a result, only the ETL user itself should have rights to modify these tables, so rather than specify readers
and writers
, schemas only specify groups
, which are the user groups allowed SELECT
access.
...
"schemas": [
...
{
"name": "star",
"description": "star schema for all orders and users information",
"groups": [ "analyst_ro" ]
}
]
...
Users and groups are created as part of the initialize
command. Any group mentioned in schema permissions arrays or as an attribute of a configured user is created.
Permissions to act on a given table are typically driven by a user's group. Permissions on a relation will match what a schema's definition indicates only when that relation is in standard position. When relations are in private positions (backup or staging), only etl
has any permissions on them (or their schemas).
Redshift's implementation of a table/schema's owner
works on the basis of users rather than groups. All schemas in the DW are owned
by the ETL user, unless the configuration of a user supplies a schema along with that user's "group" and "name". In that case, the user gets a privately owned schema where they (but not other members of their user group) can create tables. This is useful for integrating with in-database BI tools like Looker. However, some manual (ie, via SQL rather than Arthur) user management may still be needed.