Replies: 13 comments 8 replies
-
|
@boshek @ChrisJohnNOAA First, I am not certain exactly how ERDDAP™ accesses data in a parquet file, I will leave that to Chris, but since you like Claude.AI give it this prompt: does the Java parquet driver require you to read in the entire file, or does it allow you to select without reading the entire file into memory And you will see that no it is not required to read the entire file into memory to make an extract. Generally we do not like to add dataset types that are specific to one application. But if I understand your suggestion in a broader sense you are requesting to add ADBC as one of the available services along with JDBC. This will still mean that ERDDAP™ is accessed in the usual way, through an URL, but ERDDAP™s access would be more efficient for supported "filetypes". This might be considered since it is application neutral, but as with all requests we have to weigh the time and effort to develop it against the benefits relative to what already exists, how many people would benefit from this, and how this compares to other things that our development efforts could be put to. |
Beta Was this translation helpful? Give feedback.
-
|
The EDDTableFromParquetFiles reads entire local files currently. You can see the read implementation here: https://github.com/ERDDAP/erddap/blob/main/WEB-INF/classes/gov/noaa/pfel/coastwatch/pointdata/Table.java#L14059 We possibly could add support for partial file reads (I think likely very complicated to get significant performance benefits - my guess is the main benefit to what DuckDB is doing above is reducing the over the wire data, not the local file parsing) and cloud files to the current implementation. If there's an easy way to support that through EDDTableFromDuckDB it might be a more maintainable approach. EDDTableFromDuckDB via ADBC sounds interesting. Can DuckDB ADBC be installed through Maven (this is how we manage dependencies for ERDDAP)? The instructions involve manually downloading a library, which may be difficult to broadly support for ERDDAP installations across all operating systems. I also didn't see examples for Java which concerned me. I'd also be curious what dependencies we'd need to add (Apache Arrow, DuckDB ADBC, others?), how large they are, and how well maintained. Is there a full list of cloud provider's and data formats it supports? This one is more trying to define how much benefit this would bring to ERDDAP. |
Beta Was this translation helpful? Give feedback.
-
|
This seems really promising to me! I would love to be able to use ERDDAP to read cloud hosted hive partitioned datasets efficiently. Interestingly the ADBC Java offering seems to be a JDBC wrapper: https://arrow.apache.org/adbc/current/java/quickstart.html Installable via Maven: With a few dependencies: https://central.sonatype.com/artifact/org.apache.arrow.adbc/adbc-driver-jdbc/dependencies All told it adds ~3MB of dependency jars to |
Beta Was this translation helpful? Give feedback.
-
|
@srstsavage @ChrisJohnNOAA @boshek I dont think it is correct that ADBC Java is just an ODBC wrapper. LLM's can be wrong, but see below my query to Claude. An advantage of ADBC is that it is column oriented while JDBC is row oriented, so it is much more efficient for applications and formats that are also column oriented. Does the ADBC (Arrow) java library just build on top of JDBC, or does it only use that if the target doesn't support Arrow, The ADBC Java library is not built on top of JDBC. The JDBC adapter is a separate, explicitly-chosen driver for legacy databases — not a fallback that kicks in automatically. Here's the architecture:
adbc-driver-flight-sql — for databases supporting the Flight SQL wire protocol (Dremio, etc.)
Database has a native ADBC driver (Snowflake, DuckDB, BigQuery) → pure Arrow, no JDBC involved The JDBC adapter exists to maximize database coverage, not as a performance shortcut.Want to be notified when Claude responds? |
Beta Was this translation helpful? Give feedback.
-
|
@srstsavage @ChrisJohnNOAA @boshek Also, we may already be able to do this. I have written up how to use duckdb from ERDDAP™ using JDBC. But that just sends the SQL query, duckDB does its own thing (and my timings on this were very good). Again from Claude: If I access a parquet file using duckdb, but access duckdb using JDBC, does duckdb just use the JDBC to get the SQL query, and then uses its more efficient access libraries to read the parquet file Yes, exactly right — and this is an important and commonly misunderstood point. The JDBC layer is just the query delivery mechanism; DuckDB's execution engine operates entirely independently of it. JDBC receives your SQL string and passes it to DuckDB's query parser/planner — that's all JDBC does on the input side. So the overhead of using JDBC with DuckDB is: Negligible on the read side — DuckDB reads the Parquet file exactly as efficiently as if you called it natively This is why the ADBC DuckDB driver (which keeps results in Arrow columnar format end-to-end) is faster than the JDBC driver for large result sets, but both will read the Parquet file with equal efficiency. The bottleneck with JDBC is getting the data out of DuckDB, not getting data into DuckDB's query engine. |
Beta Was this translation helpful? Give feedback.
-
|
@srstsavage @ChrisJohnNOAA @boshek Also: Can duckDB itself be accessed through ADBC? Yes — and this actually closes the loop on the whole conversation nicely. From the earlier search results we already have the answer: So the most useful thing in my opinion is to have an ADBC datatype in ERDDAP™ and then to do what Sam requested use that to hook into duckDB, as I have done with JDBC. That should provide the fastest solution while still maintaining the ERDDAP™ structure |
Beta Was this translation helpful? Give feedback.
-
|
@srstsavage @ChrisJohnNOAA @boshek and as my coffee kicks in, my writeup on using ERDDAP™ with duckDB through JDBC: https://erddap.github.io/docs/server-admin/admin-tips/duckdb |
Beta Was this translation helpful? Give feedback.
-
|
Also some useful tools if we want to add an ADBC datatype: ADBC-quickstarts - https://github.com/columnar-tech/adbc-quickstarts/tree/main/java I have been following ADBC and Arrow Flight for awhile, and was thinking of suggesting adding them, was waiting for more adoption and stable tooling, but that appears to have arrived. |
Beta Was this translation helpful? Give feedback.
-
|
Great discussion! One question of @ChrisJohnNOAA that hasn't been answered yet is:
From duckdb docs:
And for data source duckdb supports lots but I think some of the most exciting here would Parquet and Iceberg. So the scope of what this would unlock is broad. |
Beta Was this translation helpful? Give feedback.
-
|
It appears that duckDB now has a standard interface for the small DuckDB database that needs to be constructed to use with other data store. Again being lazy I turned to Claude: I guess I should redo the writeup and test again with more examples. But anything duckDB can now do, ERDDAP™ can do through JDBC. As I have said, the big difference with ADBC is speed. |
Beta Was this translation helpful? Give feedback.
-
|
I think the big question here is, is there a benefit to directly integrating duckdb (rather than using the EDDTableFromDatabase approach Roy demonstrated)? I assume it would simplify the process for admins to use it. It might also allow us to take advantage of additional duckdb features or have a performance advantage (like through ADBC). That is a good set of cloud providers supported, I had previously only found a page that listed AWS, Azure, and non-specific mention of other providers. |
Beta Was this translation helpful? Give feedback.
-
|
@ChrisJohnNOAA I just checked and you are correct, but it is in memory. I am pretty certain when I checked awhile back it could not directly do foreign formats. One drawback is that the view doesn't persist across connections. It does with the way I approached it (the view information is stored in the duckdb file) or use a startup script to recreate the view before making the connection. So depending. you may find creating the stub file gives better performance, it depends on how long it takes to create the view. The stub file does not have to be on the ERDDAP™ server (i think) it just needs to be on a reachable server that has duckDB installed - I mean it has to be installed somewhere. But i think you will get better performance if the duckDB instance is on the ERDDAP™ server, even if the data files are not, though O am not certain of this. |
Beta Was this translation helpful? Give feedback.
-
|
I've opened #448 to track this idea. There's also an existing issue #434 about updating the cache strategy. While not well defined at this time, I believe it's similar to what @rmendels mentioned above. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
#215 delivered
EDDTableFromParquetFileswhich is an implementation for reading local or cached Parquet files viaparquet-java. But if my understanding is correct, this implementation will read Parquet files in full then filter the result in memory.That work is amazing but I think could be extended in this way: Parquet data in S3, queried live with duckdb, without downloading files to the ERDDAP host first. This type of workflow can be so useful and it would be nice to bring it to ERDDAP. As an example here is duckdb going from 4.36 million rows of data:
and using the pushdown predicates to efficiently query these parquet files stored in s3:
Indeed #281 took one approach using DuckDB's JDBC driver.
This discussion proposes an alternative path that could potentially avoid JDBC entirely. Full disclosure some of this is fleshed out with Claude. Still it is not entirely fleshed out (hence opening as a discussion) but am gauging if any folks have tried this and if folks would be interested in this:
Proposal:
EDDTableFromDuckDBvia ADBCADBC (Arrow Database Connectivity) is a database connectivity standard built on Apache Arrow. DuckDB has first-class ADBC support, and Java bindings exist.
Critically, ADBC talks to DuckDB's native API rather than the JDBC layer, which means
read_parquet('s3://...'), Hive-partitioned datasets, and Iceberg tables are all available without the.dbfile constraint that blocked #281The proposed data flow:
User ERDDAP request
Is there appetite for a dataset type that introduces a DuckDB process dependency, or is the preference to keep ERDDAP dependencies minimal?
Beta Was this translation helpful? Give feedback.
All reactions