This repository provides example applications and further benchmarking tools to evaluate and get started with Rheem.
Below we provide detailed information on our various benchmark components, including running instructions. For the configuration of Rheem itself, please consult the Rheem repository or feel free to reach out on Gitter.
Description. This app takes a text input file and counts the number occurrences of each word in the text. This simple app has become some sort of "Hello World" program for data processing systems.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.wordcount.WordCountScala
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.
Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:
- DBpedia - Long abstracts NB: Consider stripping of the RDF container around the abstracts. It's not necessary, though.
Description. Akin to Google's Word2Vec, this app creates vector representations of words from a corpus based on its neighbors. This app is a bit simpler in the sense that it calculates the average neighborhood of each word rather than determining a lower-dimensional representation. The resulting vectors can be used, e.g., to cluster words and find related terms.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.simwords.Word2NVec
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.
Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:
- DBpedia - Long abstracts NB: Consider stripping of the RDF container around the abstracts. It's not necessary, though.
Description. This app executes a query from the established TPC-H benchmark. We provide several variants that work either on data in databases, in files, or in a mixture of both. Thus, this app requires cross-platform execution.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.tpch.TpcH
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters. Note that you will have to configure Rheem, such that can access the database. Furthermore, this app depends on the following configuration keys:
rheem.apps.tpch.csv.customer
: URL to theCUSTOMER
filerheem.apps.tpch.csv.orders
: URL to theORDERS
filerheem.apps.tpch.csv.lineitem
: URL to theLINEITEM
file
Datasets. The datasets for this app can be generated with the TPC-H tools. The generated datasets can then be either put into a database and/or a filesystem.
Description. This app provides the data profiling algorithm SINDY that discovers inclusion dependencies in a relational database.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.sindy.Sindy
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.
Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:
- CSV files generated with the TPC-H tools
- other CSV files
Description. This app implements the stochastic gradient descent algorithm. SGD is an optimization algorithm that minimizes a loss function and can be used in many tasks of supervised machine learning. The current implementation uses the logistic loss and can thus, be used for classification. As many other machine learning techniques, SGD is a highly iterative algorithm.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.sgd.SGD
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.
Datasets. Find below a list of datasets that can be used to benchmark Rheem in combination with this app:
Description. Being a well-known method to cluster data points in a Euclidian space. As many other machine learning techniques, k-means is an iterative algorithm.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.kmeans.Kmeans
or
org.qcri.rheem.apps.kmeans.postgres.Kmeans
The latter assumes data to reside in a filesystem, while the other assumes data to reside in PostgreSQL. For the latter case, you will have to configure Rheem, such that it can access the database. Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.
Datasets. We provide a data generator to generate files that can be clustered. You can further load these files into the database assuming the following schema:
CREATE TABLE "<table_name_of_your_choice>" (x float8, y float8);
Description. This app implements the cross-community PageRank: It takes as input two graphs, merges them, and runs a standard PageRank on the resulting graph. The preprocessing and PageRank steps typically lend themselves to be executed on different platforms.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.crocopr.CrocoPR
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters.
Datasets. This app works on RDF files, more specifically the Wikipedia pagelinks via DBpedia. Note that this app requires two input files. For the purpose of benchmarking, it is fine to use the same input file twice.
Description. This app generates Rheem plans with specific predefined topologies but of arbitrary size. This allows to experimentally determine the scalability of Rheem's optimizer to large plans.
Running the app. To run the app, launch the main class:
org.qcri.rheem.apps.benchmark.OptimizerScalabilityTest
Even though this app is written in Scala, you can launch it in a regular JVM. Run the app without parameters to get a description of the required parameters. Furthermore, the following configuration can be interesting:
rheem.core.optimizer.pruning.strategies
: controls the pruning strategy to be used when enumerating alternative plans- admissible values: empty or comma-separated list of
org.qcri.rheem.core.optimizer.enumeration.LatentOperatorPruningStrategy
(default),org.qcri.rheem.core.optimizer.enumeration.TopKPruningStrategy
,org.qcri.rheem.core.optimizer.enumeration.RandomPruningStrategy
, andorg.qcri.rheem.core.optimizer.enumeration.SinglePlatformPruningStrategy
(order-sensitive)
- admissible values: empty or comma-separated list of
rheem.core.optimizer.pruning.topk
: controls the k for the top-k pruningrheem.core.optimizer.enumeration.concatenationprio
: controls the order of the enumeration- admissible values:
slots
,plans
,plans2
,none
,random
- admissible values:
rheem.core.optimizer.enumeration.invertconcatenations
invert the above mentioned enumeration order- admissible value:
false
(default),true
- admissible value: