Skip to content

Uncompromised benchmark to understand the trade-off between EMRFS and Alluxio for Apache Spark applications with S3 persistence.

License

Notifications You must be signed in to change notification settings

StevenMih/emrfs-vs-alluxio-benchmark

 
 

Repository files navigation

emrfs-vs-alluxio-benchmark

Uncompromised benchmark to understand the trade-off between EMRFS and Alluxio for Apache Spark applications with Amazon S3 persistence.


Contents: Environment | Setup | Workload | Benchmark | Result| Conclusion | License


Environment

  • EMR release: emr-5.23.0
  • EBR root volume size: 10 GB
  • Region: us-east-1
  • Master Node: m4.2xlarge + 2 x EBS 32 GB GP2
  • CORE Nodes: 10 x m4.4xlarge + 4 x EBS 32 GB GP2
  • Apache Spark 2.4.0
  • Python 3.6
  • Alluxio 2.0.0-preview - ASYNC_THROUGH/MUST_CACHE write type - CACHE_PROMOTE read type

Setup

For both scenarios:

  • launch EMR cluster with the related bootstrap and config files.
  • Create and attach a EMR notebook in the cluster

For Alluxio, after launch the cluster, you need to give access for the Notebook user:

sudo su alluxio
/opt/alluxio/bin/alluxio fs mkdir /test
/opt/alluxio/bin/alluxio fs chown livy:livy /test

Workload

  • Input: 1 x Text file separated by pipes (|) generated by TPC-H on S3 - 74 GB - 600 M records - 16 columns - No compression
  • Output: 1000 x Parquet files on S3 - 27 GB - SNAPPY compression

To generate a similar dataset you can follow these instruction on an EC2 with Amazon AMI

sudo yum install gcc make git -y
cd $HOME
mkdir data
git clone https://github.com/gregrahn/tpch-kit.git
cd tpch-kit/dbgen/
make OS=LINUX
export DSS_PATH=$HOME/data
./dbgen -v -T o -s 100

Benchmark

The benchmark routine measures the time elapsed to write the output (overwrite mode) and to read that.

It runs each routine 10x and extract tha minimum, maximum and average values.

P.S. The first read (From text file) and the first write (without overwrite) are ignored.

P.P.S The target Dataframe was cached to isolate the I/O performance.

Result

Alluxio - ASYNC_THROUGH

Operation Metric Time (sec)
write min 80.2
write avg 82.8
write max 86.7
read min 81.4
read avg 86.0
read max 91.6

Alluxio - MUST_CACHE

Operation Metric Time (sec)
write min 79.3
write avg 84.2
write max 91.9
read min 82.4
read avg 91.0
read max 95.4

Alluxio - MUST_CACHE - METADATA load disabled

Operation Metric Time (sec)
write min 44.6
write avg 45.6
write max 47.8
read min 80.3
read avg 83.7
read max 93.1

EMRFS

Operation Metric Time (sec)
write min 62.2
write avg 67.4
write max 77.7
read min 94.3
read avg 97.7
read max 102.7

Conclusion

  • EMRFS writes faster than Alluxio
  • Alluxio reads faster than EMRFS
  • EMRFS provides the data on S3 immediately, Alluxio does not.
  • Alluxio uses less CPU (Check Ganglia PDF files)
  • EMRFS uses less memory (Check Ganglia PDF files)

License

These codes/files are licensed under the MIT License.

About

Uncompromised benchmark to understand the trade-off between EMRFS and Alluxio for Apache Spark applications with S3 persistence.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 51.5%
  • Shell 48.5%