Query Proxy

PuffinDB is built upon the strong belief that SQL is Cloud Data's preferred language, and that countless wonderful SQL query generators like Malloy, PRQL, Baxter, or WHERE TRUE will emerge over the coming years. Furthermore, this trend will be accelerated by the advent of large language models like OpenAI's ChatGPT. These query generators will all share a common set of requirements that could be factorized by PuffinDB, with the intent of reducing R&D efforts for vendors, and delivering a better overall experience to users:

Integration with the user's client
Integration with the user's data lake
SQL parsing | serializing
SQL query optimization
Scale-out and scale-up
IP protection

Architecture

To achieve these goals, query generation is architected around four main components:

A client-side DuckDB extension
A cloud-side Query Proxy operated by PuffinDB
A cloud-side Query Generator running on the vendor's cloud or the user's VPC (depending on IP protection requirements)
A cloud-side Registry of supported query generators managed by PuffinDB

The puffindb DuckDB extension is installed once by the user. From there, support for any number of query generators can be added from DuckDB's SQL API, using the public Registry managed by PuffinDB. Once a query is submitted by the user through DuckDB's SQL API, it is sent to the cloud-side Query Proxy, which forwards it to the Query Generator. The query is then translated into SQL and sent back to the user's DuckDB engine for execution.

Several options are currently being considered for the packaging of Query Generators. One of them is to use WebAssembly, which would allow both client-side and cloud-side execution. This proposed meta-extension mechanism would also make it easier to develop certain DuckDB extensions, without having to compile them for different platforms. Another option is to provide bindings between DuckDB and a collocated CPython runtime (barebone or via Pyodide).

Query Optimization

The exact same mechanism used for remote query generation will be used for remote query optimization by the distributed query planner. Furthermore, the integration of technologies like WeTune within the distributed query planner will ensure that machine-generated queries are properly optimized, even in cases where DuckDB's built-in optimizer might prove suboptimal (query pre-optimization).

Query Delegation

The exact same architecture will be used for delegating queries to third-party engines with SELECT THROUGH. This feature can be used for:

Data integration
Data validation
Outlier detection
Error correction
Prediction
Recommendation
Synthetic data generation
etc.

Out of the box, PuffinDB will provide integration with the following databases:

Athena
Databricks
DuckDB
EMR Spark SQL
RDS for Aurora
RDS for MariaDB
RDS for MySQL
RDS for Oracle
RDS for PostgreSQL
RDS for SQL Server
Redshift
Snowflake
SQLite

Dialect Translation

When delegating a subquery to a third-party SQL engine with SELECT THROUGH, the PuffinDB extension will handle SQL dialect translation, using SQLGlot deployed on a collocated CPython runtime. This will make it easy to write composite queries involving multiple SQL engines while using a single dialect. The following dialects are currently supported:

BigQuery
ClickHouse
Databricks
Drill
DuckDB
Hive
MySQL
Oracle
Postgres
Presto
Redshift
Snowflake
Spark
SQLite
StarRocks
Tableau
Teradata
Trino
T-SQL

Benefits for Vendors

No need to develop and distribute any proprietary DuckDB extension
No need to develop yet another SQL parser | serializer
Access to the metadata of tables managed by the most popular lakehouses (Apache Iceberg, Apache Hudi, Delta Lake)
Direct integration with the user's VPC
Scale-out and scale-up of complex and | or large queries through distributed SQL engine
IP protection when running query generator on vendor's cloud

Benefits for Users

No need to install one DuckDB extension for every query generator
Cloud-side acceleration of query generation and query optimization
Faster queries (through query pre-optimization)
Data and metadata confidentiality when running query generator on user's VPC or user's client

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Proxy.md

Query Proxy.md

Query Proxy

Architecture

Query Optimization

Query Delegation

Dialect Translation

Benefits for Vendors

Benefits for Users

Files

Query Proxy.md

Latest commit

History

Query Proxy.md

File metadata and controls

Query Proxy

Architecture

Query Optimization

Query Delegation

Dialect Translation

Benefits for Vendors

Benefits for Users