Rheem Benchmarks

This repository provides example applications and further benchmarking tools to evaluate and get started with Rheem.

Below we provide detailed information on our various benchmark components, including running instructions. For the configuration of Rheem itself, please consult the Rheem repository or feel free to reach out on Gitter.

Rheem applications

WordCount

Description. This app takes a text input file and counts the number occurrences of each word in the text. This simple app has become some sort of "Hello World" program for data processing systems.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.wordcount.WordCountScala

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:

DBpedia - Long abstracts NB: Consider stripping of the RDF container around the abstracts. It's not necessary, though.

Word2NVec

Description. Akin to Google's Word2Vec, this app creates vector representations of words from a corpus based on its neighbors. This app is a bit simpler in the sense that it calculates the average neighborhood of each word rather than determining a lower-dimensional representation. The resulting vectors can be used, e.g., to cluster words and find related terms.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.simwords.Word2NVec

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:

DBpedia - Long abstracts NB: Consider stripping of the RDF container around the abstracts. It's not necessary, though.

TPC-H Query 3

Description. This app executes a query from the established TPC-H benchmark. We provide several variants that work either on data in databases, in files, or in a mixture of both. Thus, this app requires cross-platform execution.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.tpch.TpcH

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters. Note that you will have to configure Rheem, such that can access the database. Furthermore, this app depends on the following configuration keys:

rheem.apps.tpch.csv.customer: URL to the CUSTOMER file
rheem.apps.tpch.csv.orders: URL to the ORDERS file
rheem.apps.tpch.csv.lineitem: URL to the LINEITEM file

Datasets. The datasets for this app can be generated with the TPC-H tools. The generated datasets can then be either put into a database and/or a filesystem.

SINDY

Description. This app provides the data profiling algorithm SINDY that discovers inclusion dependencies in a relational database.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.sindy.Sindy

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:

CSV files generated with the TPC-H tools
other CSV files

SGD

Description. This app implements the stochastic gradient descent algorithm. SGD is an optimization algorithm that minimizes a loss function and can be used in many tasks of supervised machine learning. The current implementation uses the logistic loss and can thus, be used for classification. As many other machine learning techniques, SGD is a highly iterative algorithm.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.sgd.SGD

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:

k-means

Description. Being a well-known method to cluster data points in a Euclidian space. As many other machine learning techniques, k-means is an iterative algorithm.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.kmeans.Kmeans

or

org.qcri.rheem.apps.kmeans.postgres.Kmeans

The latter assumes data to reside in a filesystem, while the other assumes data to reside in PostgreSQL. For the latter case, you will have to configure Rheem, such that it can access the database. Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. We provide a data generator to generate files that can be clustered. You can further load these files into the database assuming the following schema:

CREATE TABLE "<table_name_of_your_choice>" (x float8, y float8);

CrocoPR

Description. This app implements the cross-community PageRank: It takes as input two graphs, merges them, and runs a standard PageRank on the resulting graph. The preprocessing and PageRank steps typically lend themselves to be executed on different platforms.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.crocopr.CrocoPR

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.

Datasets. This app works on RDF files, more specifically the Wikipedia pagelinks via DBpedia. Note that this app requires two input files. For the purpose of benchmarking, it is fine to use the same input file twice.

Optimizer experiments

Optimizer scalability

Description. This app generates Rheem plans with specific predefined topologies but of arbitrary size. This allows to experimentally determine the scalability of Rheem's optimizer to large plans.

Running the app. To run the app, launch the main class:

org.qcri.rheem.apps.benchmark.OptimizerScalabilityTest

Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters. Furthermore, the following configuration can be interesting:

rheem.core.optimizer.pruning.strategies: controls the pruning strategy to be used when enumerating alternative plans
- admissible values: empty or comma-separated list of org.qcri.rheem.core.optimizer.enumeration.LatentOperatorPruningStrategy (default), org.qcri.rheem.core.optimizer.enumeration.TopKPruningStrategy, org.qcri.rheem.core.optimizer.enumeration.RandomPruningStrategy, and org.qcri.rheem.core.optimizer.enumeration.SinglePlatformPruningStrategy (order-sensitive)
rheem.core.optimizer.pruning.topk: controls the k for the top-k pruning
rheem.core.optimizer.enumeration.concatenationprio: controls the order of the enumeration
- admissible values: slots, plans, plans2, none, random
rheem.core.optimizer.enumeration.invertconcatenations invert the above mentioned enumeration order
- admissible value: false (default), true

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rheem Benchmarks

Rheem applications

WordCount

Word2NVec

TPC-H Query 3

SINDY

SGD

k-means

CrocoPR

Optimizer experiments

Optimizer scalability

About

Releases

Packages

Contributors 3

Languages

rheem-ecosystem/rheem-benchmark

Folders and files

Latest commit

History

Repository files navigation

Rheem Benchmarks

Rheem applications

WordCount

Word2NVec

TPC-H Query 3

SINDY

SGD

k-means

CrocoPR

Optimizer experiments

Optimizer scalability

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages