Skip to content

5 Tips and Tricks

Tom Vogels edited this page Dec 21, 2019 · 5 revisions

Tips and tricks to work with Arthur

Loading From Anywhere (for Dev or Debugging)

During development, it is often useful to "clone" an existing environment and do loads from that. Similarly, you may have the need to do a load from an existing environment like development or production in order to test or debug something. Or you might create one-off environments to for performance testing or the like.

1. Clone an Environment

For example, to copy from an environment called "production":

$ cd arthur-redshift-etl
$ source venv/bin/activate
$ bin/sync_env.sh $MY_ETL_S3BUCKET production $NAME_OF_SHINY_NEW_ENV
$ aws s3 cp config/credentials_testing.sh s3://$MY_ETL_S3BUCKET/$NAME_OF_SHINY_NEW_ENV/config/credentials_testing.sh

Note: Make sure to upload a credentials file that has settings pointing to the correct target cluster!

2. (Optional) Deploy Code

You might want to run code you're working on instead of the code that came with the cloned environment, so you need to deploy after the copy in that case.

$ arthur.py sync --prefix $NAME_OF_SHINY_NEW_ENV

3. Fix Manifest Files

The data manifest files in the cloned environment still point to the data files in the old environment. You need to fix the manifests by regenerating them from the new environment so that any load you do from the new environment will not accidentally read from the old environment's files (which might be subject to changing).

$ arthur.py extract --use-existing-csv-files --prefix $NAME_OF_SHINY_NEW_ENV

4. Kick Off Load

# Double check first:
$ arthur.py ping
$ arthur.py load --prefix $NAME_OF_SHINY_NEW_ENV --dry-run
# Do the real load (in the background)
$ nohup arthur.py load --prefix $NAME_OF_SHINY_NEW_ENV &

Miscellaneous

Dumping new schemas

Add the new schemas to the database

arthur.py initialize -r

Note that you will have to copy your configuration file with the new schema declaration

arthur.py sync --deploy ???

And if you're already running a cluster, pull down the configuration for testing:

aws s3 cp --recursive "s3://<BUCKET NAME>/<YOUR USER NAME>/config/" /tmp/redshift_etl/config/

Inspecting the ETL events table in Redshift

While the ETL runs, it writes events to a DynamoDB table. Here's an example:

{
    "environment":"production",
    "etl_id":"FCBC6BDED3F04D2C",
    "target":"tallboy.redshift_pipe_cleaner",
    "monitor_id":"12984C70489A46BA",
    "step":"unload",
    "event":"start",
    "extra": {
        "destination": ...,
        "source": ...
    },
    "timestamp":"1472883599.972402"
}

You can copy this table into Redshift using these steps.

Setting up a schema and table for the events

As the ETL user, run:

CREATE SCHEMA arthur_redshift_etl AUTHORIZATION etl;
GRANT USAGE ON SCHEMA arthur_redshift_etl TO GROUP etl_ro;

CREATE TABLE arthur_redshift_etl.events (
    id BIGINT IDENTITY(1, 1),
    environment VARCHAR(255) NOT NULL ENCODE TEXT255,
    etl_id VARCHAR(32) NOT NULL ENCODE TEXT32K,
    target VARCHAR(255),
    monitor_id VARCHAR(32) NOT NULL ENCODE TEXT32K,
    step VARCHAR(32) ENCODE BYTEDICT,
    event VARCHAR(32) NOT NULL ENCODE BYTEDICT,
    elapsed NUMERIC(18,6),
    "timestamp" NUMERIC(18,6) NOT NULL ENCODE DELTA,
    event_timestamp TIMESTAMP ENCODE DELTA,
    UNIQUE (etl_id, monitor_id, event)
)
DISTKEY ( monitor_id )
SORTKEY ( environment, etl_id, target, step, event )
;

GRANT SELECT ON arthur_redshift_etl.events TO GROUP etl_ro
;

Copying events from DynamoDB

Make sure that your read capacity is high enough, then run

TRUNCATE arthur_redshift_etl.events
;

# Look up the actual name of the role in IAM!
COPY arthur_redshift_etl.events
FROM 'dynamodb://dw-etl-prod-production-events'
IAM_ROLE 'arn:aws:iam::<account>:role/dw-vpc-dev-RedshiftCopyRole-<UNIQUEID>'
READRATIO 200
COMPUPDATE OFF
STATUPDATE OFF
;

UPDATE arthur_redshift_etl.events
SET event_timestamp = timestamp 'epoch' + "timestamp" * interval '1 second'
;

ANALYZE arthur_redshift_etl.events
;

SELECT COUNT(*) FROM arthur_redshift_etl.events
;