- This repository hosts the source code and supplementary materials for our:
- VLDB 2024 submission (accepted), GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization
- SIGMOD 2024 Demo submission (accepted), A Demonstration of GPTuner: A GPT-Based Manual-Reading Database Tuning System
- GPTuner collects and refines heterogeneous domain knowledge, unifies a structured view of the refined knowledge, and uses the knowlege to (1) select important knobs, (2) optimize the value range of each knob and (3) explore the optimized space with a novel Coarse-to-Fine Bayesian Optimization Framework.
- A video demonstration is available at YouTube!
Stay tuned for the latest updates and enhancements in this project! π
Remember to star β and subscribe π for the newest features and improvements!
GPTuner is a manual-reading database tuning system to suggest satisfactory knob configurations with reduced tuning costs. The figure above presents the tuning workflow that involves seven steps:
- π User provides the DBMS to be tuned (e.g., PostgreSQL or MySQL), the target workload, and the optimization objective (e.g., latency or throughput).
- π GPTuner collects and refines the heterogeneous knowledge from different sources (e.g., GPT-4, DBMS manuals, and web forums) to construct Tuning Lake, a collection of DBMS tuning knowledge.
- π GPTuner unifies the refined tuning knowledge from Tuning Lake into a structured view accessible to machines (e.g., JSON).
- π GPTuner reduces the search space dimensionality by selecting important knobs to tune (i.e., fewer knobs to tune means fewer dimensions).
- π GPTuner optimizes the search space in terms of the value range for each knob based on structured knowledge.
- π GPTuner explores the optimized space via a novel Coarse-to-Fine Bayesian Optimization framework.
- π Finally, GPTuner identifies satisfactory knob configurations within resource limits (e.g., the maximum optimization time or iterations specified by users).
The following instructions have been tested on Ubuntu 20.04 and PostgreSQL v14.9:
sudo apt-get update
sudo apt-get install postgresql-14
Step 2: Install BenchBase with our script
- Note: the script is tested on
openjdk version "17.0.8.1" 2023-08-24
, (you may need to update openjdk toversion 21
to keep in pace with the newest benchbase), please prepare your JAVA environment first
cd ./scripts
sh install_benchbase.sh postgres
- Note: modify
./benchbase/target/benchbase-postgres/config/postgres/sample_{your_target_benchmark}_config.xml
to customize your tuning setting first
sh build_benchmark.sh postgres tpch
sudo pip install -r requirements.txt
- Note: modify
configs/postgres.ini
to determine the target DBMS first, therestart
andrecover
commands depend on the environment and we provide Docker version - Note: modify
src/run_gptuner.py
to set up yourapi_base
,api_key
andmodel
first
# PYTHONPATH=src python src/run_gptuner.py <dbms> <benchmark> <timeout> <seed>
PYTHONPATH=src python src/run_gptuner.py postgres tpch 180 -seed=100
where <dbms>
specifies the DBMS (e.g., postgres or mysql), <benchmark>
is the target workload (e.g., tpch or tpcc), <timeout>
is the maximum time allowed to stress-test the benchmark, <seed>
is the random seed used by the optimizer.
The optimization result is stored in optimization_results/{dbms}/{stage}/{seed}/runhistory.json
, where {dbms}
is the target DBMS, {stage}
is coarse or fine and {seed}
is the random seed given by user.
- the
data
block contains the following information, we explain the project-related information below. For more details, please refer to SMAC3 Library.config_id
: i is the identifier for the knob configuration given by i-th iteration- instance
- budget
- seed
cost
: the optimization objective (e.g., throughput or latency)- time
- status
- starttime
- endtime
- additional_info
- the
"configs"
block contains the knob configuration of the i-th iteration, for example:
"configs": {
"1": {
"effective_io_concurrency": 200,
"random_page_cost": 1.2
},
}
Step 1: Complete Steps 1 to 4 in the Quick Start section
PYTHONPATH=src python -m streamlit run src/demo/entrypage.py
Step 3: Follow our video demonstration to use the GUI
We compare GPTuner with state-of-the-art methods both using or not using natural language knowledge as input:
- DB-BERT SIGMOD'22: a DBMS tuning tool that uses BERT to read the manuals and use the gained information to guide Reinforcement Learning (RL)
- SMAC: the best Bayesian Optimiztion (BO)-based method evaluated in an Experimental Evaluation VLDB'22
- GP: the classic Gassian Process-based BO approach used in iTuned VLDB'09 and OtterTune SIGMOD'17
- DDPG++: a RL-based tuning method proposed in CDBTune SIGMOD'19 and improved in Inquiry VLDB'21
We compare GPTuner with baselines on different DBMS (PostgreSQL and MySQL), benchmarks (TPC-H and TPC-C) and metrics (throughput and latency). We present the results on PostgreSQL in this repository. For more details, please refer to our paper or technical report.
configs/
postgres.ini
: Configuration file to optimize PostgreSQLmysql.ini
: Configuration file to optimize MySQL
optimization_results/
temp_results/
: Temporary storage for optimization resultspostgres/
coarse/
: Coarse-stage optimization results for PostgreSQLfine/
: Fine-stage optimization results for PostgreSQL
scripts/
install_benbase.sh
: Script to install the BenchBase benchmark toolbuild_benchmark.sh
: Script to build benchmark environmentsrecover_postgres.sh
: Script to recover the state of PostgreSQL databaserecover_mysql.sh
: Script to recover the state of MySQL database
knowledge_collection/
postgres/
target_knobs.txt
: List of target knobs for PostgreSQL tuningknob_info/
system_view.json
: Information from PostgreSQL system views (pg_settings)official_document.json
: Information from PostgreSQL official documentation
knowledge_sources/
gpt/
: Knowledge sourced from GPT modelsmanual/
: Knowledge from DBMS manualsweb/
: Knowledge extracted from web sourcesdba/
: Knowledge from database administrators
tuning_lake/
: Data lake for DBMS tuning knowledgestructured_knowledge/
special/
: Specialized structured knowledgenormal/
: General structured knowledge
example_pool/
: Pool of examples for prompt ensemble algorithmsql
: Provide sql statements if you need query-level knob selectionsrc/
: Source codedemo/
: Module to execute the GUI (Demonstration Code)dbms/
dbms_template.py
: Template for database management systemspostgres.py
: Implementation for PostgreSQLmysql.py
: Implementation for MySQL
knowledge_handler/
gpt.py
: Module for interactions with GPTknowledge_preparation.py
: Module for knowledge preparation (Sec. 5.1)knowledge_transformation.py
: Module for knowledge transformation (Sec. 5.2)
space_optimizer/
knob_selection.py
: Module for knob selection (Sec. 6.1)default_space.py
: Definition of default search spacecoarse_space.py
: Definition of coarse search space (Sec. 6.2)fine_space.py
: Definition of fine search space (Sec. 6.2)
config_recommender/
workload_runner.py
: Module to run workloadscoarse_stage.py
: Recommender for coarse stage configuration (Sec. 7)fine_stage.py
: Recommender for fine stage configuration (Sec. 7)
run_gptuner.py
: Main script to run GPTuner
- Paper version
- GPTuner uses OpenAI completion API of
gpt-4
orgpt-3.5-turbo
- GPTuner leverages tuning knowledge from
GPT-4
,DBMS official manuals
andweb contents
- GPTuner supports
PostgreSQL
andMySQL
- GPTuner stress-tests workloads through the
BenchBase
tool
- GPTuner uses OpenAI completion API of
- Future implementation (We warmly invite and appreciate your contributions! π«)
- GPTuner employs
locally depolyed large language models
as well - GPTuner collects web contents through
web-gpt
andweb-crawler
- GPTuner uses a
generic
stress-test tool, supportingany given workload
optimization - GPTuner refines its
knowledge_collection
with ahuman-in-the-loop
mechanism - GPTuner supports more
DBMS
- to be continued...
- GPTuner employs
If you use this codebase, or otherwise found our work valuable, please cite π:
@article{10.14778/3659437.3659449,
author = {Lao, Jiale and Wang, Yibo and Li, Yufei and Wang, Jianping and Zhang, Yunjia and Cheng, Zhiyuan and Chen, Wanghu and Tang, Mingjie and Wang, Jianguo},
title = {GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization},
year = {2024},
issue_date = {April 2024},
publisher = {VLDB Endowment},
volume = {17},
number = {8},
issn = {2150-8097},
url = {https://doi.org/10.14778/3659437.3659449},
doi = {10.14778/3659437.3659449},
abstract = {Modern database management systems (DBMS) expose hundreds of configurable knobs to control system behaviours. Determining the appropriate values for these knobs to improve DBMS performance is a long-standing problem in the database community. As there is an increasing number of knobs to tune and each knob could be in continuous or categorical values, manual tuning becomes impractical. Recently, automatic tuning systems using machine learning methods have shown great potentials. However, existing approaches still incur significant tuning costs or only yield sub-optimal performance. This is because they either ignore the extensive domain knowledge available (e.g., DBMS manuals and forum discussions) and only rely on the runtime feedback of benchmark evaluations to guide the optimization, or they utilize the domain knowledge in a limited way. Hence, we propose GPTuner, a manual-reading database tuning system that leverages domain knowledge extensively and automatically to optimize search space and enhance the runtime feedback-based optimization process. Firstly, we develop a Large Language Model (LLM)-based pipeline to collect and refine heterogeneous knowledge, and propose a prompt ensemble algorithm to unify a structured view of the refined knowledge. Secondly, using the structured knowledge, we (1) design a workload-aware and training-free knob selection strategy, (2) develop a search space optimization technique considering the value range of each knob, and (3) propose a Coarse-to-Fine Bayesian Optimization Framework to explore the optimized space. Finally, we evaluate GPTuner under different benchmarks (TPC-C and TPC-H), metrics (throughput and latency) as well as DBMS (PostgreSQL and MySQL). Compared to the state-of-the-art approaches, GPTuner identifies better configurations in 16x less time on average. Moreover, GPTuner achieves up to 30\% performance improvement (higher throughput or lower latency) over the best-performing alternative.},
journal = {Proc. VLDB Endow.},
month = {may},
pages = {1939β1952},
numpages = {14}
}
@inproceedings{10.1145/3626246.3654739,
author = {Lao, Jiale and Wang, Yibo and Li, Yufei and Wang, Jianping and Zhang, Yunjia and Cheng, Zhiyuan and Chen, Wanghu and Zhou, Yuanchun and Tang, Mingjie and Wang, Jianguo},
title = {A Demonstration of GPTuner: A GPT-Based Manual-Reading Database Tuning System},
year = {2024},
isbn = {9798400704222},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626246.3654739},
doi = {10.1145/3626246.3654739},
abstract = {Selecting appropriate values for the configurable knobs of Database Management Systems (DBMS) is crucial to improve performance. But because such complexity has surpassed the abilities of even the best human experts, database community turns to machine learning (ML)-based automatic tuning systems. However, these systems still incur significant tuning costs or only yield sub-optimal performance, attributable to their overly high reliance on black-box optimization and an oversight of domain knowledge. This paper demonstrates GPTuner, a manual-reading database tuning system that leverages Large Language Model (LLM) to bridge the gap between black-box optimization and white-box domain knowledge. This demonstration empowers (1) regular users with limited tuning experience to gain qualitative insights on the features of knobs, and optimize their DBMS performance automatically and efficiently, (2) database administrators and experts to further enhance GPTuner by simply contributing their invaluable tuning suggestions in natural language. Finally, we offer visitors the opportunity to explore a range of DBMS and optimization metrics, coupled with the flexibility to tailor their target workloads to their specific needs.},
booktitle = {Companion of the 2024 International Conference on Management of Data},
pages = {504β507},
numpages = {4},
keywords = {bayesian optimization, database tuning, large language model},
location = {<conf-loc>, <city>Santiago AA</city>, <country>Chile</country>, </conf-loc>},
series = {SIGMOD/PODS '24}
}