Uncompromised benchmark to understand the trade-off between EMRFS and Alluxio for Apache Spark applications with Amazon S3 persistence.
Contents: Environment | Setup | Workload | Benchmark | Result| Conclusion | License
- EMR release: emr-5.23.0
- EBR root volume size: 10 GB
- Region: us-east-1
- Master Node: m4.2xlarge + 2 x EBS 32 GB GP2
- CORE Nodes: 10 x m4.4xlarge + 4 x EBS 32 GB GP2
- Apache Spark 2.4.0
- Python 3.6
- Alluxio 2.0.0-preview - ASYNC_THROUGH/MUST_CACHE write type - CACHE_PROMOTE read type
For both scenarios:
- launch EMR cluster with the related bootstrap and config files.
- Create and attach a EMR notebook in the cluster
For Alluxio, after launch the cluster, you need to give access for the Notebook user:
sudo su alluxio
/opt/alluxio/bin/alluxio fs mkdir /test
/opt/alluxio/bin/alluxio fs chown livy:livy /test
- Input: 1 x Text file separated by pipes (|) generated by TPC-H on S3 - 74 GB - 600 M records - 16 columns - No compression
- Output: 1000 x Parquet files on S3 - 27 GB - SNAPPY compression
To generate a similar dataset you can follow these instruction on an EC2 with Amazon AMI
sudo yum install gcc make git -y
cd $HOME
mkdir data
git clone https://github.com/gregrahn/tpch-kit.git
cd tpch-kit/dbgen/
make OS=LINUX
export DSS_PATH=$HOME/data
./dbgen -v -T o -s 100
The benchmark routine measures the time elapsed to write the output (overwrite mode) and to read that.
It runs each routine 10x and extract tha minimum, maximum and average values.
P.S. The first read (From text file) and the first write (without overwrite) are ignored.
P.P.S The target Dataframe was cached to isolate the I/O performance.
Alluxio - ASYNC_THROUGH
Operation | Metric | Time (sec) |
---|---|---|
write | min | 80.2 |
write | avg | 82.8 |
write | max | 86.7 |
read | min | 81.4 |
read | avg | 86.0 |
read | max | 91.6 |
Alluxio - MUST_CACHE
Operation | Metric | Time (sec) |
---|---|---|
write | min | 79.3 |
write | avg | 84.2 |
write | max | 91.9 |
read | min | 82.4 |
read | avg | 91.0 |
read | max | 95.4 |
Alluxio - MUST_CACHE - METADATA load disabled
Operation | Metric | Time (sec) |
---|---|---|
write | min | 44.6 |
write | avg | 45.6 |
write | max | 47.8 |
read | min | 80.3 |
read | avg | 83.7 |
read | max | 93.1 |
EMRFS
Operation | Metric | Time (sec) |
---|---|---|
write | min | 62.2 |
write | avg | 67.4 |
write | max | 77.7 |
read | min | 94.3 |
read | avg | 97.7 |
read | max | 102.7 |
- EMRFS writes faster than Alluxio
- Alluxio reads faster than EMRFS
- EMRFS provides the data on S3 immediately, Alluxio does not.
- Alluxio uses less CPU (Check Ganglia PDF files)
- EMRFS uses less memory (Check Ganglia PDF files)
These codes/files are licensed under the MIT License.