Skip to content

3 Logging and event tables

Tom Vogels edited this page Aug 14, 2017 · 1 revision

Logging

We use standard Python logging which by default goes to arthur.log files in the working directory. Log files are rotated at about 5MB and the last 5 log files are kept.

While the ETL is running, log events are are also emitted. They show up as JSON in ETL lines for easy consumption by log listeners. It is also possible to have them sent to other data stores.

The minimal version of an event when running locally looks like this:

{
  'etl_id': "UUID",
  'timestamp': float,
  'target': "target table in the warehouse",
  'step': "sub-command like 'dump' or 'load'",
  'event': "what happened: start, finish, or fail"
}

Additional fields may be present. In the case of an error, a code and message will be added within an array. (We may choose to add a similar field for warnings at some point.)

{
    ...
    'errors': [
        {
            'code': "PG_ERROR",
            'message': "You don't have permission"
    ]
    ...
}

When running within an AWS cluster, the IDs from the EMR cluster, and optionally the ID from the (master) instance, data pipeline and step are captured.

{
    ...
    'aws': {
        'emr_id': "j-UUID",
        'instance_id': "i-UUID",
        'data_pipeline_id': "df-UUID",
        'step_id': "s-UUID"
    }
    ...
}

If the environment variable ETL_ENVIRONMENT is set, it is copied to the environment field.

{
    ...
    'environment': "production"
    ...
}

When dumping data from upstream, the source is an object describing the upstream source and the destination is the location of the manifest in S3 in the "finish" event:

{
    ...
    'source': {
        'name': "www",
        'schema': "public",
        'table': "orders"
    },
    'destination': {
        'bucket_name': "our-data-lake",
        'object_key': "www/public-orders.manifest"
    }
    ...
}

When loading data from S3, then the source describes the object location of the data manifest and the destination is the table in the data warehouse:

{
    ...
    'source': {
        'bucket_name': "our-data-lake",
        'object_key': "production/www/public-orders.manifest"
    },
    'destination': {
        'schema': "www",
        'table': "orders"
    }
    ...
}