PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees

This repository contains the code for PilotDB, an online middleware approximate query processing system that

provides user-specified error guarantees (i.e., a priori error guarantees)
requires no offline maintenance
requires no modifications to the underlying database management system
achieves up to 126x speedups

To achieve these features, we develop the following features in PilotDB:

TAQA: the two-stage query approximation middleware
BSAP: efficient block sampling for approximate queries with statistical guarantees

🔥 Getting Started

Warning

We recommend running on an x86_64 or amd64 machine.

🚀 Quick Setup via Docker

You will need to have the docker engine installed in your machine.

cp dockerfile .env.example .env and set your own environment variables.
Build images and start containers:
```
cd dockerfile
docker compose up --build
```

Test connections

curl http://localhost:8080/test_postgres_connection
curl http://localhost:8080/test_sqlserver_connection
curl http://localhost:8080/test_duckdb_connection

To use PilotDB, you will need to

Populate the database with any data you like
Prepare a yaml file with database configurations and credentials. See examples in db_configs.
Execute an approximate query by running pilotdb_client.py as follows:
```
python pilotdb_client.py \
    --aqp \
    --query_file path/to/sql/file.sql \
    --db_config_file path/to/db/config/file.yml \
    --error 0.05 \
    --probability 0.05
```
where --error and --probability specify the error rate and failure probability for a priori error guarantees.

🌰 Example

We use an example to show the power of PilotDB. To run this example, you will need at least 265 GB disk space.

Download pre-generated DuckDB database for TPC-H 1TB.

wget https://blobs.duckdb.org/data/tpch-sf1000.db $DUCKDB_DATA

Start containers and test connection

cd dockerfile
docker compose.yml up
curl http://localhost:8080/test_duckdb_connection

Run queries and compare latency

python pilotdb_client.py --mode aqp --query_file example/tpch-6.sql --db_config_file example/duckdb_config.yml --error 0.05 --probability 0.05
python pilotdb_client.py --mode exact --query_file example/tpch-6.sql --db_config_file example/duckdb_config.yml --error 0.05 --probability 0.05

🧪 Local Setup (Recommended for Benchmark Evaluation)

To set up PilotDB locally, make sure you have the following tools installed, which might require root permissions.

uv
Microsoft ODBC for SQL Server 2022
PostgreSQL

Install python packages:

uv sync
uv pip install -e .

Prepare db config files using the template in db_configs. Then, you can run PilotDB as a python package

import pilotdb
db_config = {
    "dbms": "postgres", # or duckdb, sqlserver
    "username": "tester",
    "dbname": "testdb",
    "host": "hostname",
    "port": "dbport",
    "password": "password"
}
conn = pilotdb.connect("postgres", db_config)
result = pilotdb.run(
    conn,
    query="SELECT AVG(x) FROM T",
    error=0.05,
    probability=0.05 # the failure probability
)
pilotdb.close(conn)

or via the evaluate.py scrip. For example,

python evaluate.py \
    --benchmark tpch \
    --qid 1 \
    --dbms postgres \
    --db_config_file db_configs/postgres.yml \
    --process_mode aqp \
    --error 0.05 \
    --failure_probability 0.05

📐 Reproduce Results

You can follow the procedures below to reproduce the experimental results in our paper.

Set Up Benchmarks

You need to set up the underlying database manage system with data loaded.

Make sure PostgreSQL and SQL Server are installed
Generate data for benchmarks
- TPC-H: The official TPC-H toolkit.
- SSB: We use the community version of the data generation script since the original SSB dataset is not open-sourced. We double-checked the correctness of those scripts.
- ClickBench: The official GitHub repo of ClickBench by ClickHouse.
- Instacart: The official Instacart data released in Kaggle.
- DSB: The official GitHub repo of DSB by Microsoft.

Run Experiment Scripts

Based on the experiment you want to reproduce, you can run the corresponding script in the folder experiments.

To run the experiments of the quey speedups on various DBMSs and workloads (Section 5.3), you can execute the bash scripts experiments/query_speedups/run_pilotdb_<benchmark>.sh, where <benchmark> $\in$ {tpch,ssb,clickbench,instacart}.
To run the experiments of the quey speedups on skewed data distribution (Section 5.3), you can execute the bash scripts experiments/query_speedups/run_pilotdb_dsb.sh.
To run the experiments of the query speedups under different target errors (Section 5.3), you can execute the bash script experiments/query_speedups/run_pilotdb_varying_errors.sh
To run the experiments of ablation study (Section 5.5), you can execute the bash scripts experiments/ablation/run_pilotdb_<config>.sh, where <config> $\in$ {oracle,uniform}.
To run the experiments of PilotDB without BSAP (Supplementary Material B.1), you can execute the bash script experiments/error_guarantees/run_pilotdb_nobsap.sh.
To run the sensitivity experiments (Supplementary Material B.2), you can execute the bash scripts experiments/sensitivity/run_sensitivity_<knob>.sh, where <knob> $\in$ {pilot-rate,selectivity}.
To reproduce the Figure 4, you can execute the bash script experiments/others/sampling_method_system_efficiency/run.sh.
To reproduce the failure example in Section 4.3, you can execute the python script others/standard_clt_coverage_example/run_duckdb.py.

Citation

@article{zhu2025pilotdb,
  title={PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees},
  author={Zhu, Yuxuan and Jin, Tengjun and Baziotis, Stefanos and Zhang, Chengsong and Mendis, Charith and Kang, Daniel},
  journal={Proceedings of the ACM on Management of Data},
  volume={3},
  number={3},
  pages={1--28},
  year={2025},
  publisher={ACM New York, NY, USA}
}

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
.assets		.assets
benchmarks		benchmarks
db_configs		db_configs
dockerfile		dockerfile
example		example
experiments		experiments
pilotdb		pilotdb
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
evaluate.py		evaluate.py
pilotdb_client.py		pilotdb_client.py
pilotdb_server.py		pilotdb_server.py
pyproject.toml		pyproject.toml
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees

🔥 Getting Started

🚀 Quick Setup via Docker

🌰 Example

🧪 Local Setup (Recommended for Benchmark Evaluation)

📐 Reproduce Results

Set Up Benchmarks

Run Experiment Scripts

Citation

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

uiuc-kang-lab/PilotDB

Folders and files

Latest commit

History

Repository files navigation

PilotDB: Database-Agnostic Online Approximate Query Processing with A Priori Error Guarantees

🔥 Getting Started

🚀 Quick Setup via Docker

🌰 Example

🧪 Local Setup (Recommended for Benchmark Evaluation)

📐 Reproduce Results

Set Up Benchmarks

Run Experiment Scripts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages