Skip to content

1.1 Data Warehouse (Sources, transformations, users)

Tom Vogels edited this page Aug 14, 2017 · 1 revision

Upstream sources

Each upstream source will have its own schema in the data warehouse. We support loading data from upstream databases (PostgreSQL) or S3 locations.

The sources array should then contain objects which have a name and include_tables field. The other fields depend on the type of source.

Upstream databases

The sources array should contain objects which have a name, read_access and include_tables field. The name will be used later to create a schema in the data warehouse. The read_access field is used to store the name of an environment variable that contains the connection string to the upstream database. Note that arthur.py can read a file with environment variables so that the variables don't necessarily have to be set when starting this script.

Each source also indicates what user groups should be able to select from or delete/insert/update the rows of its tables. Groups with only SELECT access are listed in a readers array, while groups that can modify the table contents are writers.

Finally, include_tables is a list of "glob" patterns to select which of the upstream tables should actually be imported into the data warehouse. You can also specify a list with exclude_tables that lists "glob" patterns of tables that should be excluded.

Also, add the optional description field to add a note about the upstream source.

"sources": [
    {
        "name": "www",
        "read_access": "DATABASE_URI",
        "include_tables": [ "public.*" ]
    }
]

A better example:

"sources": [
    {
        "name": "www",
        "description": "application database from our web site",
        "read_access": "DATABASE_URI",
        "include_tables": [ "public.*" ],
        "exclude_tables": [ "public.passwords" ],
        "readers": [ "analyst_ro" ],
        "writers": [ "www_loader_rw" ]
    }
]

S3 Locations

We can also load data directly from locations in S3. This is useful for data that is either not changing (think reference data or historic data) or is generated by other systems that directly write into S3.

Readers and writers work as above.

"sources": [
    {
        "name": "static",
        "s3_bucket": "my-company-library",
        "s3_path_template": "${prefix}/static_files",
        "include_tables": [ "library.*" ]
        "readers": [ "analyst_ro" ],
        "writers": [ "static_loader_rw" ]
    }
]

The s3_path_template may contain references to prefix, today and yesterday. The value prefix will be whatever is set in arthur.py, meaning the user name or the value of the --prefix argument. The values of today and yesterday are dates matching a "year/month/day" format.

Data Warehouse

Adding schemas in the data warehouse for transformations

The DW settings schemas section denotes the namespaces where CTAS or VIEW relations are created.

These are not data sources, but the transformation result of the data sources. As a result, only the ETL user itself should have rights to modify these tables, so rather than specify readers and writers, schemas only specify groups, which are the user groups allowed SELECT access.

    ...
    "schemas": [
        ...
        {
            "name": "star",
            "description": "star schema for all orders and users information",
            "groups": [ "analyst_ro" ]
        }
    ]
    ...

Users and groups

Users and groups are created as part of the initialize command. Any group mentioned in schema permissions arrays or as an attribute of a configured user is created.

Permissions to act on a given table are typically driven by a user's group. Permissions on a relation will match what a schema's definition indicates only when that relation is in standard position. When relations are in private positions (backup or staging), only etl has any permissions on them (or their schemas).

Redshift's implementation of a table/schema's owner works on the basis of users rather than groups. All schemas in the DW are owned by the ETL user, unless the configuration of a user supplies a schema along with that user's "group" and "name". In that case, the user gets a privately owned schema where they (but not other members of their user group) can create tables. This is useful for integrating with in-database BI tools like Looker. However, some manual (ie, via SQL rather than Arthur) user management may still be needed.