-
Notifications
You must be signed in to change notification settings - Fork 11
3 Logging and event tables
We use standard Python logging which by default goes to arthur.log
files in the working directory. Log files
are rotated at about 5MB and the last 5 log files are kept.
While the ETL is running, log events are are also emitted. They show up as JSON in ETL lines for easy consumption by log listeners. It is also possible to have them sent to other data stores.
The minimal version of an event when running locally looks like this:
{
'etl_id': "UUID",
'timestamp': float,
'target': "target table in the warehouse",
'step': "sub-command like 'dump' or 'load'",
'event': "what happened: start, finish, or fail"
}
Additional fields may be present. In the case of an error, a code
and message
will be
added within an array. (We may choose to add a similar field for warnings at some point.)
{
...
'errors': [
{
'code': "PG_ERROR",
'message': "You don't have permission"
]
...
}
When running within an AWS cluster, the IDs from the EMR cluster, and optionally the ID from the (master) instance, data pipeline and step are captured.
{
...
'aws': {
'emr_id': "j-UUID",
'instance_id': "i-UUID",
'data_pipeline_id': "df-UUID",
'step_id': "s-UUID"
}
...
}
If the environment variable ETL_ENVIRONMENT
is set, it is copied to the environment
field.
{
...
'environment': "production"
...
}
When dumping data from upstream, the source
is an object describing the upstream source and the destination
is the location of the manifest in S3 in the "finish" event:
{
...
'source': {
'name': "www",
'schema': "public",
'table': "orders"
},
'destination': {
'bucket_name': "our-data-lake",
'object_key': "www/public-orders.manifest"
}
...
}
When loading data from S3, then the source
describes the object location of the data manifest
and the destination is the table in the data warehouse:
{
...
'source': {
'bucket_name': "our-data-lake",
'object_key': "production/www/public-orders.manifest"
},
'destination': {
'schema': "www",
'table': "orders"
}
...
}