A curated paper list of awesome AI4DB theory, frameworks, resources, tools and other awesomeness, for data engineers.
The repository is under construction. Welcome new PR, please conform to the committed rules:
paperName(with pdf link) [MeetingName Year] Github link if it has open-sourced code (optional)
Thanks to all authors of the paper/repository I cite :D
- AI4DB Paper Sets
- LEON: A New Framework for ML-Aided Query Optimization [VLDB 23]
- LOGER: A Learned Optimizer towards Generating Efficient and Robust Query Execution Plans [VLDB 23]
- Eraser: Eliminating Performance Regression on Learned Query Optimizer [VLDB 24]
- AutoSteer: Learned Query Optimization for Any SQL Database [VLDB 24]
- Modeling Shifting Workloads for Learned Database Systems [SIGMOD 24]
- Stage: Query Execution Time Prediction in Amazon Redshift [SIGMOD 24]
- Roq: Robust Query Optimization Based on a Risk-aware Learned Cost Model [arXiv 24]
- RobOpt: A Tool for Robust Workload Optimization Based on Uncertainty-Aware Machine Learning [SIGMOD Demo 24]
- Towards Exploratory Query Optimization for Template-based SQL Workloads [ICDE 24]
- DSB: a decision support benchmark for workload-driven and traditional database systems [VLDB 21]
- Expand your training limits! generating training data for ml-based data management [VLDB 21]
- LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning [SIGMOD 22]
- Hit the Gym: Accelerating Query Execution to Efficiently Bootstrap Behavior Models for Self-Driving Database Management Systems [VLDB 24]
- Cardinality Estimation: An Experimental Survey [VLDB 17]
- Are We Ready For Learned Cardinality Estimation? [VLDB 21]
- Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation [VLDB 21]
- Learned cardinality estimation: A design space exploration and a comparative evaluation [VLDB 22]
- Learned Cardinality Estimation: An In-depth Study [SIGMOD 22]
- A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies [VLDB 24]
- Selectivity estimation for range predicates using lightweight models [VLDB 19]
- Deep learning models for selectivity estimation of multiattribute queries [SIGMOD 20]
- Learned Cardinalities: Estimating Correlated Joins with Deep Learning [CIDR 2019]
- An End-to-End Learning-based Cost Estimator [VLDB 19]
- Flow-Loss: Learning Cardinality Estimates That Matter [VLDB 21]
- Speeding Up End-to-end Query Execution via Learning-based Progressive Cardinality Estimation [SIGMOD 23]
- Robust Query Driven Cardinality Estimation under Changing Workloads[VLDB 23]
- AutoCE: An Accurate and Efficient Model Advisor for Learned Cardinality Estimation [ICDE 23]
- Asm: Harmonizing autoregressive model, sampling, and multi-dimensional statistics merging for cardinality estimation [SIGMOD 24]
- Adding Domain Knowledge to Query-Driven Learned Databases [SIGMOD 24]
- Self-tuning, gpu-accelerated kernel density models for multidimensional selectivity estimation [SIGMOD 15]
- Deep Unsupervised Cardinality Estimation [VLDB 19]
- Quicksel: Quick selectivity learning with mixture models [SIGMOD 20]
- Pre-training Summarization Models of Structured Datasets for Cardinality Estimation [VLDB 22]
- DeepDB: Learn from Data, not from Queries! [VLDB 20]
- NeuroCard: One Cardinality Estimator for All Tables [VLDB 21]
- FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation [VLDB 21]
- BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation [aiXiv 21]
- Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size [aiXiv 21]
- Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation [VLDB 21]
- FACE: a normalizing flow based cardinality estimator [VLDB 22]
- FactorJoin: A New Cardinality Estimation Framework for Join Queries [SIGMOD 22] (Bounded)
- Cardinality Estimation of LIKE Predicate Queries using Deep Learning [SIGMOD 25]
- Bao: Making Learned Query Optimization Practical [SIMOD 21]
- FASTgres: Making Learned Query Optimizer Hinting Effective [VLDB 23]
- COOOL: A Learning-To-Rank Approach for SQL Hint Recommendations [VLDB 23]
- Zero-Shot Cost Models for Out-of-the-box Learned Cost Prediction [VLDB 22]
- Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection [VLDB 22]
- Lero: A Learning-to-Rank Qery Optimizer [VLDB 23]
- Lero: applying learning-to-rank in query optimizer [VLDBJ 24]
- Learning to Optimize Join queries With Deep Reinforcement Learning [SIGMOD 16]
- Deep Reinforcement Learning for Join Order Enumeration[arXiv 18]
- Reinforcement Learning with Tree-LSTM for Join Order Selection [ICDE 20]
- Modeling Shifting Workloads for Learned Database Systems [SIGMOD 24]
- Db2une: Tuning Under Pressure via Deep Learning [VLDB 24]
- PilotScope: Steering Databases with Machine Learning Drivers [VLDB 24]
- Cosine: A Cloud-Cost Optimized Self-Designing Key-Value Storage Engine [VLDB 22]
- TreeLine: An Update-In-Place Key-Value Store for Modern Storage [VLDB 22]
- Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads [SIGMOD 23]
- Limousine: Blending Learned and Classical Indexes to Self-Design Larger-than-Memory Cloud Storage Engines [SIGMOD 24]
- The Case for Learned Index Structures [SIGMOD 18]
- FITing-Tree: A Data-aware Index Structure [SIGMOD 19]
- ALEX: An Updatable Adaptive Learned Index [aiXiv 20]
- The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds [VLDB 20]
- RadixSpline: a single-pass learned index [aiDM 20]
- Why Are Learned Indexes So Effective? [ICML 20]
- A Pluggable Learned Index Method via Sampling and Gap Insertion [aiXiv 21]
- Updatable Learned Index with Precise Positions [VLDB 21]
- The next 50 years in database indexing or: the case for automatically generated index structures [VLDB 21]
- Tuning Hierarchical Learned Indexes on Disk and Beyond [SIGMOD 22]
- APEX: A High-Performance Learned Index on Persistent Memory [VLDB 22]
- FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems [VLDB 22]
- Are Updatable Learned Indexes Ready? [VLDB 22]
- CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm [VLDB 22]
- NFL: Robust Learned Index via Distribution Transformation [VLDB 22]
- Cutting Learned Index into Pieces: An In-depth Inquiry into Updatable Learned Indexes [ICDE 23]
- Learning Multi-dimensional Indexes [SIGMOD 20]
- LISA: A Learned Index Structure for Spatial Data [SIGMOD 20]
- Effectively Learning Spatial Indices [VLDB 20]
- The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries [EDBT 20]
- Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [VLDB 21]
- NEIST: a Neural-Enhanced Index for Spatio-Temporal Queries [TKDE 21]
- RW-Tree: A Learned Workload-aware Framework for R-tree Construction [ICDE 22]
- The Case for Automatic Database Administration using Deep Reinforcement Learning [arXiv 18]
- AI Meets AI: Leveraging Query Executions to Improve Index Recommendations [SIGMOD 19]
- Online Index Selection Using Deep Reinforcement Learning for a Cluster Database [ICDEW 20]
- SMARTIX: A database indexing agent based on reinforcement learning [Applied Intelligence 20]
- Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms [VLDB 20]
- An Index Advisor Using Deep Reinforcement Learning [CIKM 20]
- Automated Database Indexing Using Model-Free Reinforcement Learning [ICAPS 20]
- DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees [ICDE 21]
- Index selection for NoSQL database with deep reinforcement learning [Information Sciences 21]
- MANTIS: Multiple Type and Attribute Index Selection using Deep Reinforcement Learning [IDEAS 21]
- AutoIndex: An Incremental Index Management System for Dynamic Workloads [ICDE 22]
- SWIRL: Selection of Workload-aware Indexes using Reinforcement Learning [EDBT 22]
- Indexer++: Workload-Aware Online Index Tuning with Transformers and Reinforcement Learning [SAC 22]
- Budget-aware Index Tuning with Reinforcement Learning [SIGMOD 22]
- ISUM: Efficiently Compressing Large and Complex Workloads for Scalable Index Tuning [SIGMOD 22]
- DISTILL: low-overhead data-driven techniques for filtering and costing indexes for scalable index tuning [VLDB 22]
- HMAB: Self-Driving Hierarchy of Bandits for Integrated Physical Database Design Tuning [VLDB 22]
- SmartIndex: An Index Advisor with Learned Cost Estimator [CIKM 22]
- Learned Index Benefits: Machine Learning Based Index Performance Estimation [VLDB 23]
- No DBA? No Regret! Multi-Armed Bandits for Index Tuning of Analytical and HTAP Workloads With Provable Guarantees [TKDE 23]
- IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse Workloads [EuroMLSys 24]
- Leveraging Dynamic and Heterogeneous Workload Knowledge to Boost the Performance of Index Advisors [PVLDB 24]
- Refactoring Index Tuning Process with Benefit Estimation [PVLDB 24]
- Breaking It Down: An In-Depth Study of Index Advisors [PVLDB 24]
- TRAP: Tailored Robustness Assessment for Index Advisors via Adversarial Perturbation [ICDE 24]
- Automatic Database Index Tuning: A Survey [TKDE 24]
- Robustness of Updatable Learning-based Index Advisors against Poisoning Attack [SIGMOD 24]
- Wii: Dynamic Budget Reallocation In Index Tuning [SIGMOD 24]
- Wred: Workload Reduction for Scalable Index Tuning [SIGMOD 24]
- ML-Powered Index Tuning: An Overview of Recent Progress and Open Challenges [SIGMOD 24]
- Automatic Database Management System Tuning Through Large-scale Machine Learning [SIGMOD 17]
- Deploying a Steered Query Optimizer in Production at Microsof [SIGMOD 22]
- Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data [SIGMOD 23]
- AutoSteer: Learned Query Optimization for Any SQL Database [SIGMOD 23]
- Auto-WLM: Machine Learning Enhanced Workload Management in Amazon Redshif [SIGMOD 23]
- GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization [VLDB 24]
- D-Bot: Database Diagnosis System using Large Language Models [VLDB 24]
- LLMTune: Accelerate Database Knob Tuning with Large Language Models [VLDB 24]
- ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models [VLDB 24]
- A Survey on Large Language Models for Code Generation [arXiv 24]
- Fuzz4All: Universal Fuzzing with Large Language Models [ICSE 24]
- LLM-PBE: Assessing Data Privacy in Large Language Models [VLDB 24]
- Are Large Language Models a Good Replacement of Taxonomies? [VLDB 24]
- A survey on augmenting knowledge graphs (KGs) with large language models (LLMs): models, evaluation metrics, benchmarks, and challenges [Discover Artificial Intelligence 24]
- 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning [SIGMOD 25]
- LLM-R2 : A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency [VLDB 25]
- Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models [VLDB 25]
- Large Language Model-Based Agents for Software Engineering: A Survey [arXiv 25]
- Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation [arXiv 25]