Demo implementation of minimal Java & .Net support and C language bindings#2929
Draft
jamesblackburn wants to merge 10 commits intoduckdbfrom
Draft
Demo implementation of minimal Java & .Net support and C language bindings#2929jamesblackburn wants to merge 10 commits intoduckdbfrom
jamesblackburn wants to merge 10 commits intoduckdbfrom
Conversation
Updates CLAUDE.md with benchmarking, code review, and DuckDB guidance. Adds .gitignore entries for build artifacts. Updates Claude docs for C++ pipeline, Arrow, and Python module references.
Adds LazyRecordBatchIterator with async prefetch, Arrow output frame construction, and segment-level filter evaluation. Extends version store API with create_lazy_record_batch_iterator_with_metadata() for streaming reads from storage backends. Includes C++ unit tests and benchmarks.
Adds lib.sql(), lib.explain(), and lib.duckdb() context manager for executing SQL queries on ArcticDB symbols via DuckDB. Includes pushdown optimizer that extracts column projections, WHERE filters, date range filters, and LIMIT from SQL AST and pushes them to ArcticDB's storage engine. Data is streamed to DuckDB via Arrow record batches for memory efficiency. Adds OutputFormat enum, options module, and duckdb optional dependency.
Adds comprehensive test coverage for SQL queries, pushdown optimization, context manager, schema DDL, lazy streaming, and error handling. Includes ASV benchmarks, profiling scripts, SQL tutorial, Jupyter demo notebook, OutputFormat API docs, FAQ update, and internal Claude docs.
uniform_int_distribution<size_t>(0.0, 1.0) truncated to {0, 1},
producing 50% trigger rate regardless of the configured probability.
Use uniform_real_distribution<double> to get correct [0, 1) range.
Expose ArcticDB's read path through a stable extern "C" API wrapping LocalVersionedEngine and LazyRecordBatchIterator. The ArrowArrayStream interface enables zero-copy data access from Java, .NET, Excel, and any other language with Arrow FFI support. New files: - arcticdb_c.h: Public C API header with opaque handles and visibility macros - arcticdb_c.cpp: Implementation (LMDB open, write, read stream, list symbols) - arrow_stream.hpp: ArrowArrayStream wrapper for LazyRecordBatchIterator - test_c_api_smoke.cpp: Standalone smoke test (4 tests) - test_c_api_stream_smoke.cpp: GTest suite (6 tests including version-specific reads) - C_BINDINGS.md: Technical documentation for the C API layer
Java bindings use Panama FFM API (Java 21 preview) with dlopen(RTLD_LAZY) for lazy symbol resolution. .NET bindings use P/Invoke with DllImport and Marshal for struct interop. Both implementations provide: - ArcticNative: low-level FFM/P/Invoke bindings matching the C API structs - ArcticLibrary: high-level AutoCloseable/IDisposable wrapper - Full ArrowArrayStream consumption (schema + batched array reads) - Integration tests: open/close, write/list, read stream, versioned reads, missing symbol error handling CMake: fix arcticdb_c link order (duplicate arcticdb_core_static + AWS SDK for single-pass linker) and link against libpython to resolve Python symbols from static constructors in arcticdb_core_static.
- C_BINDINGS.md: add Java/dotnet binding sections, update architecture diagram, document Python linkage and CMake link order decisions - ARCHITECTURE.md: add java/, dotnet/, bindings/ to directory structure and architecture diagram, add to testing table - language_bindings.md: user-facing tutorial with setup, usage examples, and test commands for both Java (Panama FFM) and .NET (P/Invoke) - mkdocs.yml: add Language Bindings tutorial to nav
Zero-dependency Rust crate using std::ffi for FFI interop with libarcticdb_c.so. Includes safe ArcticLibrary wrapper with Drop-based cleanup and 5 integration tests (open/close, write+list, read stream, versioned read, missing symbol error).
Rust bindings: add read_dataframe() that extracts actual column data from Arrow streams (ColumnData enum, DataFrame struct with serde::Serialize). Gateway server (excel/gateway/): axum-based HTTP server wrapping the Rust bindings with endpoints for library management, symbol listing, data reads, and test data writes. Row-oriented JSON wire format for Excel compatibility. Office.js add-in (excel/addin/): custom functions (ARCTICDB.READ, ARCTICDB.LIST) returning dynamic spilling arrays, task pane for server connection and symbol browsing with click-to-load, and ribbon commands.
44ed255 to
6a76db7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Language Bindings: Java (Panama FFM) & .NET (P/Invoke) via C API
This PR makes ArcticDB's read path accessible from Java and .NET through a
stable C shared library (libarcticdb_c.so) and the Arrow C Stream Interface.
What's new
C API layer (cpp/arcticdb/bindings/)
A extern "C" API with opaque handles and fixed-size error structs, designed to
be consumed from any language with FFI support:
LazyRecordBatchIterator)
The read path replicates
PythonVersionStore::create_lazy_record_batch_iterator_with_metadata() in pure
C++ — no Python at runtime. Data is delivered as Arrow record batches through
the standard get_schema() / get_next() / release() consumption pattern.
Java bindings (java/)
Uses Java 21's Panama FFM API (preview) — no JNI, no generated code:
handles via Linker.downcallHandle(), function pointer invocation for Arrow
callbacks
records
unused Python symbols
.NET bindings (dotnet/)
Uses standard P/Invoke with .NET 8:
DllImportResolver for library path
Marshal.GetDelegateForFunctionPointer() for Arrow callbacks
Implementation decisions
Link order fix for libarcticdb_c.so: arcticdb_core_static and AWS SDK .a files
are duplicated on the linker command to work around glibc's single-pass
static archive resolution. Without this, AWS symbols referenced by the core
library were unresolved.
Python symbol resolution: arcticdb_core_static contains pybind11 code with
static constructors that reference Python symbols at dlopen time — even though
the C API never calls Python at runtime. We link libarcticdb_c.so against
Python3::Python to satisfy these references. Long-term, separating the core
engine from Python binding code in CMake would eliminate this dependency.
Java lazy loading: Java's System.load() uses RTLD_NOW, which fails on any
unresolved symbols. The Java bindings call dlopen directly via FFM with
RTLD_LAZY, then wrap the handle in a SymbolLookup backed by dlsym. This
cleanly avoids the Python symbol issue without requiring LD_PRELOAD hacks.
Arrow struct layouts: Hand-defined to match the x86_64 Linux ABI — ArcticError
(516 bytes), ArcticArrowArrayStream (40 bytes, 5 pointers), ArrowSchema (72
bytes), ArrowArray (80 bytes). Both Java and .NET define these identically.
Test coverage
10 integration tests (5 per language), all passing:
Documentation
consumption pattern, Java/dotnet binding details, design decisions
examples, and test commands for both languages
tables
Further work
arctic_library_open_azure() to the C API
Arrow arrays)
summary ReadResult)
native library
dependency by splitting arcticdb_core_object into Python-free and
Python-dependent targets