Mystique Unicorn app is generating lot of data. The datascience team is looking to get started with Amazon Redshift as the data warehouse platform of choice.
Can you help them with a proof of concept, to show them how to use Amazon Redshift?
We will be build a simple 1-node
redshift cluster, ingest sample data from S3 and run some simple queries.
In this article, we will build an architecture, similar to the one shown above.
-
This demo, instructions, scripts and cloudformation template is designed to be run in
us-east-1
. With few modifications you can try it out in other regions as well(Not covered here).- π AWS CLI Installed & Configured - Get help here
- π AWS CDK Installed & Configured - Get help here
- π Python Packages, Change the below commands to suit your OS, the following is written for amzn linux 2
- Python3 -
yum install -y python3
- Python Pip -
yum install -y python-pip
- Virtualenv -
pip3 install virtualenv
- Python3 -
-
-
Get the application code
git clone https://github.com/miztiik/redshift-demo cd redshift-demo
-
-
We will use
cdk
to make our deployments easier. Lets go ahead and install the necessary components.# You should have npm pre-installed # If you DONT have cdk installed npm install -g aws-cdk # Make sure you in root directory python3 -m venv .venv source .venv/bin/activate pip3 install -r requirements.txt
The very first time you deploy an AWS CDK app into an environment (account/region), youβll need to install a
bootstrap stack
, Otherwise just go ahead and deploy usingcdk deploy
.cdk bootstrap cdk ls # Follow on screen prompts
You should see an output of the available stacks,
redshift-demo-vpc-stack redshift-demo-stack
-
Let us walk through each of the stacks,
-
Stack: redshift-demo-stack
To host the redshift cluster, we need a vpc. This will be created as a dependant stack for us. The cluster database is named as
comments_cluster
with superuser asdwh_user
. The password for the cluster is generated from AWS Secrets Manager. To interact with other AWS Services like S3, we need an IAM Role. This will be created along with the clusterInitiate the deployment with the following command,
cdk deploy redshift-demo-stack
Check the
Outputs
section of the stack for the cluster endpoint and password.
-
-
If you have Postgresql client, you can connect to the cluster using the same. But here we will use the AWS Console Redshift Editor for admin functions. In this repo, you will find
data_samples/src/amzn_reviews_en.json
. Copy them to your S3 bucket. We will ingest these files to our cluster. _You can find more sample data here 2.If you have python locally installed, here is a dataframe of the
amzn_reviews_en.json
file. Based on this, we can create a table with the following schema>>> df language product_category product_id ... review_title reviewer_id stars 0 en baby_product product_en_0878845 ... Not worth the price and very bad cap design reviewer_en_0987470 1.0 1 en shoes product_en_0004522 ... Garbage! reviewer_en_0731158 1.0 2 en office_product product_en_0060687 ... I do not recommend this printer reviewer_en_0793876 1.0 3 en office_product product_en_0311791 ... Don't purchase these refurbished cartridges! reviewer_en_0837288 1.0 4 en baby_product product_en_0472877 ... Not worth reviewer_en_0878169 1.0 ... ... ... ... ... ... ... ... 4995 en home product_en_0068982 ... So Cute! reviewer_en_0061521 5.0 4996 en kitchen product_en_0563046 ... Perfect reviewer_en_0871798 5.0 4997 en pet_products product_en_0848682 ... My Doberman Loves Them reviewer_en_0474236 5.0 4998 en other product_en_0536493 ... Five Stars reviewer_en_0546192 5.0 4999 en apparel product_en_0964853 ... SUPER CUTE! reviewer_en_0850986 5.0 [5000 rows x 8 columns]
>>> print("Schema: {}".format(pfile.schema)) Schema: language: string -- field metadata -- PARQUET:field_id: '1' product_category: string -- field metadata -- PARQUET:field_id: '2' product_id: string -- field metadata -- PARQUET:field_id: '3' review_body: string -- field metadata -- PARQUET:field_id: '4' review_id: string -- field metadata -- PARQUET:field_id: '5' review_title: string -- field metadata -- PARQUET:field_id: '6' reviewer_id: string -- field metadata -- PARQUET:field_id: '7' stars: string -- field metadata -- PARQUET:field_id: '8' -- schema metadata -- org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 545 >>>
CREATE TABLE:
We will name our table as
product_reviews_json
using the following sql command.CREATE TABLE IF NOT EXISTS product_reviews_json( review_id varchar(100) not null distkey sortkey, product_id varchar(100) not null, stars varchar(10) not null, review_body varchar(10000) not null, review_title varchar(1000) not null, reviewer_id varchar(100) not null, language varchar(2) not null, product_category varchar(100) not null, primary key(review_id) );
To make the data ingestion faster, split your files - Ideally matching your slices. Here is a list of best practices to improve performance1. One of the recommendations is to use a columnar compressed format like parquet compressed using snazzy. You can also find the reviews data in parquet format
data_samples/src/amzn_reviews_en_part00000.parquet.snappy
. In S3 you can see the size taken by these two files,_In case you made any error, you can drop the table using
DROP TABLE public.product_reviews_json;
and try again.INGEST DATA TO Redshift:
We will use the
COPY
command3 to ingest our JSON file.COPY product_reviews_json FROM 's3://YOUR-BUCKET-NAME/amzn_reviews_en.json' IAM_ROLE 'YOUR-REDSHIFT-IAM-ROLE-ARN' json 'auto ignorecase';
This file has
5000
records, let us confirm we have ingested all our data./* COUNT FOR ALL RECORDS */ SELECT COUNT(*) FROM product_reviews_json;
INGEST Performance: If you are using split and compressed files, your ingest performance can improve. Since our file is hardly
2MB
, this query shows a small improvement in performance. But do try with a large file ~ few GBs or TBsselect datediff(s,starttime,endtime) as duration, * from stl_query where query in ( 57991, /* query id of json copy */ 58689 /* query id of parquet copy */ );
TROUBLESHOOTING LOAD ERRORS: In case you are getting errors when you are loading data, You can troubleshoot those errors with these queries. The following query joins
STL_LOAD_ERRORS
toSTL_LOADERROR_DETAIL
to view the details errors that occurred during the most recent load.select d.query, substring(d.filename,14,20), d.line_number as line, substring(d.value,1,16) as value, substring(le.err_reason,1,48) as err_reason from stl_loaderror_detail d, stl_load_errors le where d.query = le.query and d.query = pg_last_copy_id();
query | substring | line | value | err_reason -------+-------------------+------+----------+---------------------------- 558| allusers_pipe.txt | 251 | 251 | String contains invalid or unsupported UTF8 code 558| allusers_pipe.txt | 251 | ZRU29FGR | String contains invalid or unsupported UTF8 code 558| allusers_pipe.txt | 251 | Kaitlin | String contains invalid or unsupported UTF8 code 558| allusers_pipe.txt | 251 | Walter | String contains invalid or unsupported UTF8 code
The following example uses
STL_LOAD_ERRORS
withSTV_TBL_PERM
to create a new view, and then uses that view to determine what errors occurred while loading data into theEVENT
table:create view loadview as (select distinct tbl, trim(name) as table_name, query, starttime, trim(filename) as input, line_number, colname, err_code, trim(err_reason) as reason from stl_load_errors sl, stv_tbl_perm sp where sl.tbl = sp.id);
Next, the following query actually returns the last error that occurred while loading the
EVENT
table:select table_name, query, line_number, colname, starttime, trim(reason) as error from loadview where table_name ='event' order by line_number limit 1;
The query returns the last load error that occurred for the
EVENT
table. If no load errors occurred, the query returns zero rows. In this example, the query returns a single error:table_name | query | line_number | colname | error | starttime ------+-----+----+----+--------------------------------------------------------+---------------------- event | 309 | 0 | 5 | Error in Timestamp value or format [%Y-%m-%d %H:%M:%S] | 2014-04-22 15:12:44 (1 row)
RUN QUERIES:
Let us say, We want to answer this question: "Do people who buy kitchen and
grocery
itemsleave
higher ratings?"SELECT stars, COUNT(stars) total_ratings FROM product_reviews_json WHERE product_category='kitchen' or product_category='grocery' GROUP BY stars;
VISUALIZING in Amazon Quicksight: To allow visualzing this data in quicksight, The redshift cluster has a security group that allows access from quicksight IP in
us-east-1
.Housekeeping Redshift: Over a period of time, tables will get filled up with data and you may have to prune them or remove them. This query will give you the size of your tables,
SELECT "table", size, tbl_rows FROM SVV_TABLE_INFO;
-
Here we have demonstrated how to create a redshift cluster, ingest data, querty and visualize data. You can extend this idea to process the data using AWS Glue and then ingest into the cluster.
-
If you want to destroy all the resources created by the stack, Execute the below command to delete the stack, or you can delete the stack from console as well
- Resources created during Deploying The Application
- Delete CloudWatch Lambda LogGroups
- Any other custom resources, you have created for this demo
# Delete from cdk cdk destroy # Follow any on-screen prompts # Delete the CF Stack, If you used cloudformation to deploy the stack. aws cloudformation delete-stack \ --stack-name "MiztiikAutomationStack" \ --region "${AWS_REGION}"
This is not an exhaustive list, please carry out other necessary steps as maybe applicable to your needs.
This repository aims to show how to setup a redshift datawarehouse to new developers, Solution Architects & Ops Engineers in AWS. Based on that knowledge these Udemy course #1, course #2 helps you build complete architecture in AWS.
Thank you for your interest in contributing to our project. Whether it is a bug report, new feature, correction, or additional documentation or solutions, we greatly value feedback and contributions from our community. Start here
Buy me a coffee β.
Level: 300