-
Notifications
You must be signed in to change notification settings - Fork 11
2 Working with table designs
The Redshift ETL supports the notion of "table designs" which describe
- how to create the table (including how to distribute it)
- what column attributes and constraints exist,
- how to retrieve column values from upstream source in case casts are needed,
- what table attributes and constraints exist.
This is true for tables and CTAS but in case of views they simply list the columns.
The extension of the file must be '.yaml' and the format is YAML. (We encourage the use of a "commented" JSON as a subset of YAML, which makes the syntax of these files easier to debug.)
The basic table design for a table from an upstream data source looks like this:
{
"name": "harryswww.orders",
"source_name": "harryswww.public.orders",
"columns": [
...
]
}
Tables can (and probably should) have a description
field.
Each column must be an object that contains at least the name
, type
and sql_type
information. The "type" here is a generic information that can be readily used when data is in flight, e.g. in Avro data files or Spark Dataframes. The "SQL type" is the type of the column which will be used when the table is created.
Here's a full toy example:
{
"name": "example.toys",
"description": "An example table that stores information about toys",
"source_name": "example.public.toys",
"columns": [
{
"name": "id",
"sql_type": "bigint",
"type": "int"
},
]
}
The exception here is that there is support for "skipped" columns -- these are columns which should not be retrieved from upstream sources but should be tracked so that nobody assumes that they were left off by accident. These columns show up like this:
... {
"name": "encrypted_password",
"skipped": true
} ...
Additional fields for column definitions are:
- the type that the column has in the upstream database:
source_sql_type
- the flag whether column is nullable:
is_null
- the SQL expression which will cast upstream data into the correct column type:
expression
- the compression:
encoding
Attributes describe how the table is distributed over compute nodes: either on all nodes, evenly distributed or distributed by a column.
Constraints describe how a column or a set of columns needs to be unique or references another table.
In most ways, CTAS are just like tables except they do not have an upstream source and are created by running a SQL query instead. Thus all of the information above applies except for "source_name"
which now must be set to the literal "CTAS"
. And note that you also have to provide a file with the query. That file must have the same base name as the table design and have an extension of .sql
.
{
"name": "sales.fact_order",
"source_name": "CTAS",
"description": "Fact table for our order star schema, based on upstream 'www' data",
"columns": [
...
]
}
Additionally, CTAS may have an "identity" column which is most useful for dimension tables.
For views, there's no need (or opportunity) to specify anything other than the names of the columns. So the table design for a view looks like this:
{
"name": "staging.users_without_pii",
"source_name": "VIEW",
"description": "View over users information but without their names and addresses",
"columns": [
{ "name": "id" },
{ "name": "last_visit" },
{ "name": "is_subscriber" },
{ "name": "state_us" }
]
}
Note that "source_name"
must be set to the literal "VIEW"
.