PuffinDB is built upon the strong belief that SQL is Cloud Data's preferred language, and that countless wonderful SQL query generators like Malloy, PRQL, Baxter, or WHERE TRUE will emerge over the coming years. Furthermore, this trend will be accelerated by the advent of large language models like OpenAI's ChatGPT. These query generators will all share a common set of requirements that could be factorized by PuffinDB, with the intent of reducing R&D efforts for vendors, and delivering a better overall experience to users:
- Integration with the user's client
- Integration with the user's data lake
- SQL parsing | serializing
- SQL query optimization
- Scale-out and scale-up
- IP protection
To achieve these goals, query generation is architected around four main components:
- A client-side DuckDB extension
- A cloud-side Query Proxy operated by PuffinDB
- A cloud-side Query Generator running on the vendor's cloud or the user's VPC (depending on IP protection requirements)
- A cloud-side Registry of supported query generators managed by PuffinDB
The puffindb
DuckDB extension is installed once by the user. From there, support for any number of query generators can be added from DuckDB's SQL API, using the public Registry managed by PuffinDB. Once a query is submitted by the user through DuckDB's SQL API, it is sent to the cloud-side Query Proxy, which forwards it to the Query Generator. The query is then translated into SQL and sent back to the user's DuckDB engine for execution.
Several options are currently being considered for the packaging of Query Generators. One of them is to use WebAssembly, which would allow both client-side and cloud-side execution. This proposed meta-extension mechanism would also make it easier to develop certain DuckDB extensions, without having to compile them for different platforms. Another option is to provide bindings between DuckDB and a collocated CPython runtime (barebone or via Pyodide).
The exact same mechanism used for remote query generation will be used for remote query optimization by the distributed query planner. Furthermore, the integration of technologies like WeTune within the distributed query planner will ensure that machine-generated queries are properly optimized, even in cases where DuckDB's built-in optimizer might prove suboptimal (query pre-optimization).
The exact same architecture will be used for delegating queries to third-party engines with SELECT THROUGH
. This feature can be used for:
- Data integration
- Data validation
- Outlier detection
- Error correction
- Prediction
- Recommendation
- Synthetic data generation
- etc.
Out of the box, PuffinDB will provide integration with the following databases:
- Athena
- Databricks
- DuckDB
- EMR Spark SQL
- RDS for Aurora
- RDS for MariaDB
- RDS for MySQL
- RDS for Oracle
- RDS for PostgreSQL
- RDS for SQL Server
- Redshift
- Snowflake
- SQLite
When delegating a subquery to a third-party SQL engine with SELECT THROUGH
, the PuffinDB extension will handle SQL dialect translation, using SQLGlot deployed on a collocated CPython runtime. This will make it easy to write composite queries involving multiple SQL engines while using a single dialect. The following dialects are currently supported:
- BigQuery
- ClickHouse
- Databricks
- Drill
- DuckDB
- Hive
- MySQL
- Oracle
- Postgres
- Presto
- Redshift
- Snowflake
- Spark
- SQLite
- StarRocks
- Tableau
- Teradata
- Trino
- T-SQL
- No need to develop and distribute any proprietary DuckDB extension
- No need to develop yet another SQL parser | serializer
- Access to the metadata of tables managed by the most popular lakehouses (Apache Iceberg, Apache Hudi, Delta Lake)
- Direct integration with the user's VPC
- Scale-out and scale-up of complex and | or large queries through distributed SQL engine
- IP protection when running query generator on vendor's cloud
- No need to install one DuckDB extension for every query generator
- Cloud-side acceleration of query generation and query optimization
- Faster queries (through query pre-optimization)
- Data and metadata confidentiality when running query generator on user's VPC or user's client