TPC-DS Benchmarking 2022

Introduction

This project is part of the Big Data Management and Analytics (BDMA) - Erasmus Mundus Joint Master Degree Program course Data Warehouse. The purpose of this repository is to allow others to reproduce our results and to explain better the steps for a meaningful TPC-DS benchmark for those seeking open-source solutions.

Reproducing results

The TPC-DS tool itself is not part of this repository it can be downloaded from official website.
For our use case, the TPC-DS tool 3.x version had build problems therefore either we could use older refined version or use the latest version on Linux.
Our team tried to install WSL (Windows Subsystem for Linux) on Windows 11 but it had driver problem and internet connectivity issues inside Ubuntu, therefore we used Docker.
Building the TPC-DS tool itself:
1. On Docker/Linux: To build the tool itself on Ubuntu, we need to install bison, flex and then use make command to build the C source code to an executable utility. Ending up with dsdgen.sh (Data Generator), dsqgen.sh (Query Generator). The detailed instructions are provided in tpc-ds-tool > tools > How_To_Guide.doc under Linux section.
2. On Windows: The tool has to be built using the oldest available Visual Studio Express (in our case 2017). We built the TPC-DS tool (TPCDS-KIT) on Windows just for exploration but went with Docker one since it supported the latest version 3.x instead of 2.x.
Generating data:
1. On Linux: To gererate the data use the command dsdgen -scale 1 -dir .\tmp -suffix .csv -delimiter "^" -parallel 4 -child 1 -quiet n -terminate n &.
2. On Windows: dsdgen /SCALE 1 /DIR .\tmp /suffix ".csv" /delimiter "^" /VERBOSE Y /PARALLEL 4 /CHILD 1 /QUIET N.
Generating queries:
1. On Linux: ./dsqgen -DIRECTORY ../query_templates -INPUT ../query_templates/templates.lst -VERBOSE Y -QUALIFY Y -DIALECT netezza.
2. On Windows: ./dsqgen /DIRECTORY ../query_templates /INPUT query_templates/templates.lst /VERBOSE Y /QUALIFY Y /DIALECT netezza.
Once the data and queries have been generated, the Python notebooks listed in the repository are self-explainitory. Nevertheless:
1. preprocess_db_setup_load_script.ipynb is to setup db and load data.
2. query_run_test_script.ipynb was used to do a test run on all 99 queries (took about 2.5 hrs for 1 SF, and I've identified 23 queries that need to be updated to match with postgres syntax)
3. all_queries folder holds the 99 queries, and also two text files, one with the list of queries with error and the other with the full result of the initial run test.
4. The folder all_queries > updated_queries contains the queries that have been optimized/modified.

Exceptional Circumstances:

To tackle the _END not defiend error add another step Ensure that the file query_templates/netezza.tpl contains the following line:

define _END = "";

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
all_queries		all_queries
analysis		analysis
performance_test		performance_test
.gitignore		.gitignore
README.md		README.md
TPC_DS_Benchmarking_Report.pdf		TPC_DS_Benchmarking_Report.pdf
preprocess_db_setup_load_script.ipynb		preprocess_db_setup_load_script.ipynb
query_performance_test_script.ipynb		query_performance_test_script.ipynb
query_run_test_script.ipynb		query_run_test_script.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TPC-DS Benchmarking 2022

Introduction

Reproducing results

Exceptional Circumstances:

Supporting Repositories

About

Releases

Packages

Languages

risg99/tpc-ds-benchmark

Folders and files

Latest commit

History

Repository files navigation

TPC-DS Benchmarking 2022

Introduction

Reproducing results

Exceptional Circumstances:

Supporting Repositories

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages