Implement EDDTableFromDuckDb

The idea is to provide a simpler (and more performant) process than that discussed in the using [DuckDB documentation](https://erddap.github.io/docs/server-admin/admin-tips/duckdb). We would use the DuckDB Java client which should avoid the admin needing to install DuckDB on the system. I believe we can also provide support to create the stub/view file in ERDDAP to simplify that step as well (alternatively we could support simple configurations through datasets.xml as well as pointing to a provided .db file to allow more complex configurations).

DuckDB provides a client that can be installed through Maven (though requires C++ redistributable on Windows). We may be able to use ADBC (Apache Arrow) for additional performance improvements.

### Discussed in https://github.com/ERDDAP/erddap/discussions/447

<div type='discussions-op-text'>

<sup>Originally posted by **boshek** March 23, 2026</sup>
https://github.com/ERDDAP/erddap/pull/215 delivered `EDDTableFromParquetFiles` which is an  implementation for reading local or cached Parquet files via`parquet-java`. But if my understanding is correct, this implementation will read Parquet files in full then filter the result in memory.

That work is amazing but I think _could_ be extended in this way: Parquet data in S3, queried live with duckdb, without downloading files to the ERDDAP host first. This type of workflow can be so useful and it would be nice to bring it to ERDDAP. As an example here is duckdb going from 4.36 million rows of data:

```sql
D select count(*) 
  from read_parquet(
      's3://noaa-ghcn-pds/parquet/by_year/YEAR=2023/ELEMENT=TMAX/*.parquet',
      hive_partitioning = true);  
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    4355909     │
│ (4.36 million) │
└────────────────┘
```

and using the pushdown predicates to efficiently query these parquet files stored in s3:

```sql
D select 
      count(*) as n_rows,
      s_flag
  from read_parquet(                                                                                       
        's3://noaa-ghcn-pds/parquet/by_year/YEAR=2023/ELEMENT=TMAX/*.parquet',                               
        hive_partitioning = true                                                                             
    )
  where ELEMENT = 'TMAX' and DATA_VALUE >= 300 and DATE = '20230105'
  group by S_FLAG;
┌────────┬─────────┐
│ n_rows │ S_FLAG  │
│ int64  │ varchar │
├────────┼─────────┤
│      5 │ U       │
│     19 │ 7       │
│    186 │ a       │
│     10 │ W       │
│    318 │ S       │
└────────┴─────────┘
```



Indeed  https://github.com/ERDDAP/erddap/issues/281 took one approach using DuckDB's JDBC driver.

This discussion proposes an alternative path that could potentially avoid JDBC entirely. Full disclosure some of this is fleshed out with Claude. Still it is not entirely fleshed out (hence opening as a discussion) but am gauging if any folks have tried this _and_ if folks would be interested in this: 

## Proposal: `EDDTableFromDuckDB` via ADBC
[ADBC (Arrow Database Connectivity)](https://arrow.apache.org/docs/format/ADBC.html) is a database connectivity standard built on Apache Arrow. DuckDB has first-class ADBC support, and Java bindings exist. 

Critically, ADBC talks to DuckDB's native API rather than the JDBC layer, which means `read_parquet('s3://...')`, Hive-partitioned datasets, and Iceberg tables are all available without the `.db` file constraint that blocked https://github.com/ERDDAP/erddap/issues/281

The proposed data flow:
User ERDDAP request
  - ERDDAP translates constraints to SQL WHERE clauses (same logic as EDDTableFromDatabase)
  - DuckDB receives query via ADBC
  - DuckDB pushes predicates down to Parquet row groups on S3
  - Only matching data is returned across the wire
  - ERDDAP maps Arrow record batches to its PrimitiveArray columns


### Is there appetite for a dataset type that introduces a DuckDB process dependency, or is the preference to keep ERDDAP dependencies minimal?

</div>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement EDDTableFromDuckDb #448

Discussed in #447

Proposal: `EDDTableFromDuckDB` via ADBC

Is there appetite for a dataset type that introduces a DuckDB process dependency, or is the preference to keep ERDDAP dependencies minimal?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement EDDTableFromDuckDb #448

Description

Discussed in #447

Proposal: EDDTableFromDuckDB via ADBC

Is there appetite for a dataset type that introduces a DuckDB process dependency, or is the preference to keep ERDDAP dependencies minimal?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: `EDDTableFromDuckDB` via ADBC