TPC-H queries implemented in Spark using the DataFrames API. Tested under Spark 2.4.0
Savvas Savvides
Under the dbgen directory do:
make
This should generate an executable called dbgen
./dbgen -h
gives you the various options for generating the tables. The simplest case is running:
./dbgen
which generates tables with extension .tbl
with scale 1 (default) for a total of rougly 1GB size across all tables. For different size tables you can use the -s
option:
./dbgen -s 10
will generate roughly 10GB of input data.
You can then either upload your data to hdfs or read them locally.
First compile using:
sbt package
Make sure you set the INPUT_DIR and OUTPUT_DIR in TpchQuery
class before compiling to point to the
location the of the input data and where the output should be saved.
You can then run a query using:
spark-submit --class "main.scala.TpchQuery" --master MASTER target/scala-2.11/spark-tpc-h-queries_2.11-1.0.jar ##
where ## is the number of the query to run e.g 1, 2, ..., 22 and MASTER specifies the spark-mode e.g local, yarn, standalone etc...
-
Data generator (http://www.tpc.org/tpch/)
-
TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
-
TPC-H for PIG (https://github.com/ssavvides/tpch-pig)