The idea is to provide a simpler (and more performant) process than that discussed in the using DuckDB documentation. We would use the DuckDB Java client which should avoid the admin needing to install DuckDB on the system. I believe we can also provide support to create the stub/view file in ERDDAP to simplify that step as well (alternatively we could support simple configurations through datasets.xml as well as pointing to a provided .db file to allow more complex configurations).
DuckDB provides a client that can be installed through Maven (though requires C++ redistributable on Windows). We may be able to use ADBC (Apache Arrow) for additional performance improvements.
Discussed in #447
Originally posted by boshek March 23, 2026
#215 delivered EDDTableFromParquetFiles which is an implementation for reading local or cached Parquet files viaparquet-java. But if my understanding is correct, this implementation will read Parquet files in full then filter the result in memory.
That work is amazing but I think could be extended in this way: Parquet data in S3, queried live with duckdb, without downloading files to the ERDDAP host first. This type of workflow can be so useful and it would be nice to bring it to ERDDAP. As an example here is duckdb going from 4.36 million rows of data:
D select count(*)
from read_parquet(
's3://noaa-ghcn-pds/parquet/by_year/YEAR=2023/ELEMENT=TMAX/*.parquet',
hive_partitioning = true);
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 4355909 │
│ (4.36 million) │
└────────────────┘
and using the pushdown predicates to efficiently query these parquet files stored in s3:
D select
count(*) as n_rows,
s_flag
from read_parquet(
's3://noaa-ghcn-pds/parquet/by_year/YEAR=2023/ELEMENT=TMAX/*.parquet',
hive_partitioning = true
)
where ELEMENT = 'TMAX' and DATA_VALUE >= 300 and DATE = '20230105'
group by S_FLAG;
┌────────┬─────────┐
│ n_rows │ S_FLAG │
│ int64 │ varchar │
├────────┼─────────┤
│ 5 │ U │
│ 19 │ 7 │
│ 186 │ a │
│ 10 │ W │
│ 318 │ S │
└────────┴─────────┘
Indeed #281 took one approach using DuckDB's JDBC driver.
This discussion proposes an alternative path that could potentially avoid JDBC entirely. Full disclosure some of this is fleshed out with Claude. Still it is not entirely fleshed out (hence opening as a discussion) but am gauging if any folks have tried this and if folks would be interested in this:
Proposal: EDDTableFromDuckDB via ADBC
ADBC (Arrow Database Connectivity) is a database connectivity standard built on Apache Arrow. DuckDB has first-class ADBC support, and Java bindings exist.
Critically, ADBC talks to DuckDB's native API rather than the JDBC layer, which means read_parquet('s3://...'), Hive-partitioned datasets, and Iceberg tables are all available without the .db file constraint that blocked #281
The proposed data flow:
User ERDDAP request
- ERDDAP translates constraints to SQL WHERE clauses (same logic as EDDTableFromDatabase)
- DuckDB receives query via ADBC
- DuckDB pushes predicates down to Parquet row groups on S3
- Only matching data is returned across the wire
- ERDDAP maps Arrow record batches to its PrimitiveArray columns
Is there appetite for a dataset type that introduces a DuckDB process dependency, or is the preference to keep ERDDAP dependencies minimal?
The idea is to provide a simpler (and more performant) process than that discussed in the using DuckDB documentation. We would use the DuckDB Java client which should avoid the admin needing to install DuckDB on the system. I believe we can also provide support to create the stub/view file in ERDDAP to simplify that step as well (alternatively we could support simple configurations through datasets.xml as well as pointing to a provided .db file to allow more complex configurations).
DuckDB provides a client that can be installed through Maven (though requires C++ redistributable on Windows). We may be able to use ADBC (Apache Arrow) for additional performance improvements.
Discussed in #447
Originally posted by boshek March 23, 2026
#215 delivered
EDDTableFromParquetFileswhich is an implementation for reading local or cached Parquet files viaparquet-java. But if my understanding is correct, this implementation will read Parquet files in full then filter the result in memory.That work is amazing but I think could be extended in this way: Parquet data in S3, queried live with duckdb, without downloading files to the ERDDAP host first. This type of workflow can be so useful and it would be nice to bring it to ERDDAP. As an example here is duckdb going from 4.36 million rows of data:
and using the pushdown predicates to efficiently query these parquet files stored in s3:
Indeed #281 took one approach using DuckDB's JDBC driver.
This discussion proposes an alternative path that could potentially avoid JDBC entirely. Full disclosure some of this is fleshed out with Claude. Still it is not entirely fleshed out (hence opening as a discussion) but am gauging if any folks have tried this and if folks would be interested in this:
Proposal:
EDDTableFromDuckDBvia ADBCADBC (Arrow Database Connectivity) is a database connectivity standard built on Apache Arrow. DuckDB has first-class ADBC support, and Java bindings exist.
Critically, ADBC talks to DuckDB's native API rather than the JDBC layer, which means
read_parquet('s3://...'), Hive-partitioned datasets, and Iceberg tables are all available without the.dbfile constraint that blocked #281The proposed data flow:
User ERDDAP request
Is there appetite for a dataset type that introduces a DuckDB process dependency, or is the preference to keep ERDDAP dependencies minimal?