Traditional data warehouse archive strategy involves moving the old data into offsite tapes. This does not quite fit the size for modern analytics applications since the data is unavailable for business analytics in real time need. Mature Hadoop clusters needs a modern data archival strategy to keep the storage expense at check when data volume increase exponentially. The term hybrid here designates an archival solution which is always available as well as completely transparent to the application layer This document will cover:
- Use case
- Requirement
- Storage cost analysis
- Design Approach
- Architecture diagram
- Code
- How to setup and Run the code
Entire business data is in HDFS (HDP clusters) backed by Amazon EBS. Disaster recovery solution is in place. Amazon claims S3 storage delivers 99.999999999% durability. In case of data loss from S3 we have to recover the data from disaster recovery site.
- Decrease storage costs.
- Archived data should be available to perform analytics 24X7.
- Access hot and cold (archived) data simultaneously from the application.
- Solution should be transparent to the application layer. In other words absolutely no change should be required from the application layer after the hybrid archival strategy is implemented.
- Performance should be acceptable.
For S3
$0.023 per GB-month of usage
Source: https://aws.amazon.com/s3/pricing/
For EBS SSD (gp2)
$0.10 per GB-month of provisioned storage
Including replication factor of 3 this becomes net $0.30 per GB
Source: https://aws.amazon.com/ebs/pricing/
Important Note:
EBS is provisioned storage whereas S3 is pay as you use.
In other words for future data growth say you provision an EBS storage of 1 TB.
You have to pay 100% for it regardless you are using 0% or 90% of it.
Whereas S3 is just the storage you are using.
So for 2GB pay for 2 GB and for 500 GB pay for 500GB.
Hence S3 price calculation is divided by 2 roughly calculating the way it will grow in correlation to the HDFS EBS storage.
All the approaches depends on the work done in the below Jira where datanode is conceptualized as a collection of heterogeneous storage with different durability and performance requirements. https://issues.apache.org/jira/browse/HDFS-2832
- Hot data with partitions that are wholly hosted by HDFS.
- Cold data with partitions that are wholly hosted by S3.
- A view that unions these two tables which is the live table that we expose to end users.
- Hot data with partitions that are wholly hosted by HDFS.
- Cold data with partitions that are wholly hosted by S3.
- Both hot and cold data are in the same table
Design 2 is chosen over Design 1 because Design 1 is not transparent to the application layer. The change from old table to the view would inherently transfer some level of porting/integration extra work to the application.
- cd /root/scripts/dataCopy
- vi hive_hybrid_storage.sh -- Put the script here
- chmod 755 hive_hybrid_storage.sh
- cd /root/scripts/dataCopy/conf
- vi test_table.conf -- This is where the cold partition names are placed
Retain the hdfs partition and delete it manually after data verification. ./hive_hybrid_storage.sh schema_name.test_table test_table.conf retain
Delete the hdfs partition as part of the script. It will delete after data is copied to s3. So there is an option to copy it back to hdfs if you want to revert the location of the partition to hdfs.
./hive_hybrid_storage.sh schema_name.test_table test_table.conf delete