-
Notifications
You must be signed in to change notification settings - Fork 11
5 Tips and Tricks
During development, it is often useful to "clone" an existing environment and do loads from that. Similarly, you may have the need to do a load from an existing environment like development or production in order to test or debug something. Or you might create one-off environments to for performance testing or the like.
For example, to copy from an environment called "production":
$ cd arthur-redshift-etl
$ source venv/bin/activate
$ bin/sync_env.sh $MY_ETL_S3BUCKET production $NAME_OF_SHINY_NEW_ENV
$ aws s3 cp config/credentials_testing.sh s3://$MY_ETL_S3BUCKET/$NAME_OF_SHINY_NEW_ENV/config/credentials_testing.sh
Note: Make sure to upload a credentials file that has settings pointing to the correct target cluster!
You might want to run code you're working on instead of the code that came with the cloned environment, so you need to deploy after the copy in that case.
$ arthur.py sync --prefix $NAME_OF_SHINY_NEW_ENV
The data manifest files in the cloned environment still point to the data files in the old environment. You need to fix the manifests by regenerating them from the new environment so that any load you do from the new environment will not accidentally read from the old environment's files (which might be subject to changing).
$ arthur.py extract --use-existing-csv-files --prefix $NAME_OF_SHINY_NEW_ENV
# Double check first:
$ arthur.py ping
$ arthur.py load --prefix $NAME_OF_SHINY_NEW_ENV --dry-run
# Do the real load (in the background)
$ nohup arthur.py load --prefix $NAME_OF_SHINY_NEW_ENV &
Add the new schemas to the database
arthur.py initialize -r
Note that you will have to copy your configuration file with the new schema declaration
arthur.py sync --deploy ???
And if you're already running a cluster, pull down the configuration for testing:
aws s3 cp --recursive "s3://<BUCKET NAME>/<YOUR USER NAME>/config/" /tmp/redshift_etl/config/
While the ETL runs, it writes events to a DynamoDB table. Here's an example:
{
"environment":"production",
"etl_id":"FCBC6BDED3F04D2C",
"target":"tallboy.redshift_pipe_cleaner",
"monitor_id":"12984C70489A46BA",
"step":"unload",
"event":"start",
"extra": {
"destination": ...,
"source": ...
},
"timestamp":"1472883599.972402"
}
You can copy this table into Redshift using these steps.
As the ETL user, run:
CREATE SCHEMA arthur_redshift_etl AUTHORIZATION etl;
GRANT USAGE ON SCHEMA arthur_redshift_etl TO GROUP etl_ro;
CREATE TABLE arthur_redshift_etl.events (
id BIGINT IDENTITY(1, 1),
environment VARCHAR(255) NOT NULL ENCODE TEXT255,
etl_id VARCHAR(32) NOT NULL ENCODE TEXT32K,
target VARCHAR(255),
monitor_id VARCHAR(32) NOT NULL ENCODE TEXT32K,
step VARCHAR(32) ENCODE BYTEDICT,
event VARCHAR(32) NOT NULL ENCODE BYTEDICT,
elapsed NUMERIC(18,6),
"timestamp" NUMERIC(18,6) NOT NULL ENCODE DELTA,
event_timestamp TIMESTAMP ENCODE DELTA,
UNIQUE (etl_id, monitor_id, event)
)
DISTKEY ( monitor_id )
SORTKEY ( environment, etl_id, target, step, event )
;
GRANT SELECT ON arthur_redshift_etl.events TO GROUP etl_ro
;
Make sure that your read capacity is high enough, then run
TRUNCATE arthur_redshift_etl.events
;
# Look up the actual name of the role in IAM!
COPY arthur_redshift_etl.events
FROM 'dynamodb://dw-etl-prod-production-events'
IAM_ROLE 'arn:aws:iam::<account>:role/dw-vpc-dev-RedshiftCopyRole-<UNIQUEID>'
READRATIO 200
COMPUPDATE OFF
STATUPDATE OFF
;
UPDATE arthur_redshift_etl.events
SET event_timestamp = timestamp 'epoch' + "timestamp" * interval '1 second'
;
ANALYZE arthur_redshift_etl.events
;
SELECT COUNT(*) FROM arthur_redshift_etl.events
;