This repository contains the code for PilotDB, an online middleware approximate query processing system that
- provides user-specified error guarantees (i.e., a priori error guarantees)
- requires no offline maintenance
- requires no modifications to the underlying database management system
- achieves up to 126x speedups
To achieve these features, we develop the following features in PilotDB:
- TAQA: the two-stage query approximation middleware
- BSAP: efficient block sampling for approximate queries with statistical guarantees
Warning
We recommend running on an x86_64
or amd64
machine.
You will need to have the docker engine installed in your machine.
cp dockerfile .env.example .env
and set your own environment variables.- Build images and start containers:
cd dockerfile docker compose up --build
- Test connections
curl http://localhost:8080/test_postgres_connection curl http://localhost:8080/test_sqlserver_connection curl http://localhost:8080/test_duckdb_connection
To use PilotDB, you will need to
- Populate the database with any data you like
- Prepare a
yaml
file with database configurations and credentials. See examples indb_configs
. - Execute an approximate query by running
pilotdb_client.py
as follows:wherepython pilotdb_client.py \ --aqp \ --query_file path/to/sql/file.sql \ --db_config_file path/to/db/config/file.yml \ --error 0.05 \ --probability 0.05
--error
and--probability
specify the error rate and failure probability for a priori error guarantees.
We use an example to show the power of PilotDB. To run this example, you will need at least 265 GB disk space.
- Download pre-generated DuckDB database for TPC-H 1TB.
wget https://blobs.duckdb.org/data/tpch-sf1000.db $DUCKDB_DATA
- Start containers and test connection
cd dockerfile docker compose.yml up curl http://localhost:8080/test_duckdb_connection
- Run queries and compare latency
python pilotdb_client.py --mode aqp --query_file example/tpch-6.sql --db_config_file example/duckdb_config.yml --error 0.05 --probability 0.05 python pilotdb_client.py --mode exact --query_file example/tpch-6.sql --db_config_file example/duckdb_config.yml --error 0.05 --probability 0.05
To set up PilotDB locally, make sure you have the following tools installed, which might require root
permissions.
- uv
- Microsoft ODBC for SQL Server 2022
- PostgreSQL
Install python packages:
uv sync
uv pip install -e .
Prepare db config files using the template in db_configs
. Then, you can run PilotDB as a python package
import pilotdb
db_config = {
"dbms": "postgres", # or duckdb, sqlserver
"username": "tester",
"dbname": "testdb",
"host": "hostname",
"port": "dbport",
"password": "password"
}
conn = pilotdb.connect("postgres", db_config)
result = pilotdb.run(
conn,
query="SELECT AVG(x) FROM T",
error=0.05,
probability=0.05 # the failure probability
)
pilotdb.close(conn)
or via the evaluate.py
scrip. For example,
python evaluate.py \
--benchmark tpch \
--qid 1 \
--dbms postgres \
--db_config_file db_configs/postgres.yml \
--process_mode aqp \
--error 0.05 \
--failure_probability 0.05
You can follow the procedures below to reproduce the experimental results in our paper.
You need to set up the underlying database manage system with data loaded.
-
Make sure PostgreSQL and SQL Server are installed
-
Generate data for benchmarks
-
TPC-H: The official TPC-H toolkit.
-
SSB: We use the community version of the data generation script since the original SSB dataset is not open-sourced. We double-checked the correctness of those scripts.
-
ClickBench: The official GitHub repo of ClickBench by ClickHouse.
-
Instacart: The official Instacart data released in Kaggle.
-
DSB: The official GitHub repo of DSB by Microsoft.
-
Based on the experiment you want to reproduce, you can run the
corresponding script in the folder experiments
.
-
To run the experiments of the quey speedups on various DBMSs and workloads (Section 5.3), you can execute the bash scripts
experiments/query_speedups/run_pilotdb_<benchmark>.sh
, where<benchmark>
$\in$ {tpch,ssb,clickbench,instacart}
. -
To run the experiments of the quey speedups on skewed data distribution (Section 5.3), you can execute the bash scripts
experiments/query_speedups/run_pilotdb_dsb.sh
. -
To run the experiments of the query speedups under different target errors (Section 5.3), you can execute the bash script
experiments/query_speedups/run_pilotdb_varying_errors.sh
-
To run the experiments of ablation study (Section 5.5), you can execute the bash scripts
experiments/ablation/run_pilotdb_<config>.sh
, where<config>
$\in$ {oracle,uniform}
. -
To run the experiments of PilotDB without BSAP (Supplementary Material B.1), you can execute the bash script
experiments/error_guarantees/run_pilotdb_nobsap.sh
. -
To run the sensitivity experiments (Supplementary Material B.2), you can execute the bash scripts
experiments/sensitivity/run_sensitivity_<knob>.sh
, where<knob>
$\in$ {pilot-rate,selectivity}
. -
To reproduce the Figure 4, you can execute the bash script
experiments/others/sampling_method_system_efficiency/run.sh
. -
To reproduce the failure example in Section 4.3, you can execute the python script
others/standard_clt_coverage_example/run_duckdb.py
.
@article{zhu2025pilotdb,
title={PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees},
author={Zhu, Yuxuan and Jin, Tengjun and Baziotis, Stefanos and Zhang, Chengsong and Mendis, Charith and Kang, Daniel},
journal={Proceedings of the ACM on Management of Data},
volume={3},
number={3},
pages={1--28},
year={2025},
publisher={ACM New York, NY, USA}
}