This repository contains ESPBench, the enterprise streaming benchmark. It allows comparing data stream processing systems (DSPSs) and architectures in an enterprise context, i.e. in an environemnt where streaming data is integrated with existing, historical business data. This repository contains the ESPBench toolkit, example configurations, and an example query implementation using Apache Beam.
For further details, see: Hesse, Guenter, et al. "ESPBench: The Enterprise Stream Processing Benchmark", ACM/SPEC International Conference on Performance Engineering (ICPE), 2021 and the ESPBench example implementation results.
1. ESPBench Architecture
2. ESPBench Process
3. Structure of the Project
4. ESPBench Setup and Execution
5. ESPBench Results
The overall architecture is visualized in the image below:
Input data is sent to Apache Kafka by the data sender tool, which is part of this repository. The DSPS runs the queries. It gets the data from Apache Kafka as input as well as from the enterprise database management system (DBMS) when required by the query. After the configured time of the benchmark run is over, the validator and result calculator, which are also part of this repository, can check the query result correctness and compute the benchmark results, i.e., the latencies.
A brief process overview of ESPBench is visualized in the activity diagram below:
The entire process is automated using Ansible scripts.
ESPBench
│ README.md
│ .gitignore
│ .gitlab-ci.yml
│ build.sbt
│ scalastyle-config.xml
│
└───ci
│ │ Dockerfile
│
└───implementation
│ └───beam
│ └───...
│
└───project
│ │ build.properties
│ │ Dependencies.scala
│
└───tools
└───commons
│ └───src
│ └───...
│ │ commons.conf
│
└───configuration
│ └───group_vars
│ └───plays
│ └───roles
│ └───...
│ │ ansible.cfg
│ │ hosts
│
└───datasender
│ └───src
│ └───...
│ │ datasender.conf
│
└───tpc-c_gen
│ └───src
│ └───...
│ │ tpc-c.properties
│
└───util
│ └───...
│
└───validator
└───...
You find brief instructions in the following. If you are looking for a more detailed setup and execution description, please have a look at this file.
All steps are tested on Ubuntu servers.
- Create a user
benchmarker
on all involved machines that has sudo access - Ensure that this user can connect to all machines via ssh w/o password, e.g., through adding the ssh public key to the
authorized_keys
files - Install
Apache Kafka
and the DSPSs to be tested under/opt/
- you can find example configurations in thetools/configurations
directory - Make Apache Kafka a service:
- Run
sudo apt install policykit-1
- Copy
tools/configuration/kafka/etc/init.d/kafka
to/etc/init.d/
on the Apache Kafka servers - Run
update-rc.d kafka defaults
- Run
- Install
PostgreSQL
on one server (you can use another DBMS, however, that requires some adaptions in the tools) - Create the directory
Benchmarks
in the home directory of the userbenchmarker
and clone the repository into this folder - The project can be built using
sbt assembly
- Define the Apache Kafka topic prefix and the benchmark run number, which will have an impact on the Apache Kafka topic names that are going to be created by the ansible scripts
- Define the query you want to execute (config for each query in comment)
- Define the sending interval, which determines the pause between sending two records - a pause of, e.g., 1,000,000ns would result in an input rate of 1,000 records/s
- Define the benchmark duration
- Define the Apache Kafka bootstrap servers and zookeeper servers
- After cloning the repository and setting up the systems, you change to the
tools/configuration
directory and start the benchmark (as shown in the activity diagram above) from here, e.g., viaansible-playbook -vvvv plays/benchmark-runner-beam.yml
for the example implementations of this repository. The number ofv
define the level of verbosity - Adapt the
group_vars/all
files if needed - The directories
plays
androles
contain several ansible files, which can be adapted if needed. The starting point that represents the entire process isplays/benchmark-runner-beam.yml
for the example implementation. These scripts also contain information about, e.g., how to start the data sender or the data generation. - The
hosts
file needs to be edited, i.e, the servers' IP addresses need to be entered
- Input data is taken from DEBS 2012 Grand Challenge, which can be downloaded from ftp://ftp.mi.fu-berlin.de/pub/debs2012/
- This data file needs to be converted using the
dos2unix
command, and duplicated, so that there are two input files. - The two files need to be extended by a machine ID using the following commands (adapt file names):
- First file:
awk 'BEGIN { FS = OFS = "\t" } { $(NF+1) = 1; print $0 }' input1.csv > output1.csv
- Second file:
awk 'BEGIN { FS = OFS = "\t" } { $(NF+1) = 2; print $0 }' input2.csv > output2.csv
- First file:
- A third input file is
production_times.csv
, which is generated by the TPC-C data generator that is part of this project. - The file
datasender.conf
contains Apache Kafka producer configs and the location of the input data files. - The file
src/main/resources/application.conf
needs the correct DBMS configuration.
The tpc-c.properties
file contains the default setting WRT the number of warehouses and the data output directory. Changes of the output directory require according adaptions in ansible scripts.
The file src/main/resources/application.conf
needs the correct DBMS configuration.
If you want to use the example implementation, you need to adapt at least two files accordingly:
implementation/beam/src/main/resources/beam.properties
- The
beamRunner
variable inbuild.sbt
The validator will create a logs
directory that will contain information about the, e.g., query result correctness, and latencies.