Skip to content

feat: support session catalog#318

Open
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat-222
Open

feat: support session catalog#318
LuciferYang wants to merge 2 commits intolance-format:mainfrom
LuciferYang:feat-222

Conversation

@LuciferYang
Copy link
Contributor

Summary

Adds LanceNamespaceSparkSessionCatalog — a CatalogExtension implementation that replaces Spark's built-in spark_catalog, allowing Lance tables to coexist with Hive/Parquet tables in the default catalog without requiring a separate named catalog.

This follows the same pattern as Iceberg's SparkSessionCatalog, with composition-based routing between an internal Lance catalog and the delegate session catalog.

Usage:

spark.sql.catalog.spark_catalog = org.lance.spark.LanceNamespaceSparkSessionCatalog
spark.sql.catalog.spark_catalog.impl = dir
spark.sql.catalog.spark_catalog.root = /path/to/lance/database
spark.sql.extensions = org.lance.spark.extensions.LanceSparkSessionExtensions
-- Lance tables
CREATE TABLE default.my_table (id INT, name STRING) USING lance;

-- Non-Lance tables still work
CREATE TABLE default.parquet_table (id INT) USING parquet;

-- SQL extensions (OPTIMIZE, VACUUM, CREATE INDEX) work through the session catalog
OPTIMIZE default.my_table WITH (target_rows_per_fragment=20000);

Key design decisions

  • default-provider config (delegate / lance / error, default delegate): controls routing when the v2 catalog's createTable is called without a provider (e.g., programmatic catalog access). Unlike Iceberg which defaults to Iceberg, we default to delegate to preserve backward compatibility. Note: CREATE TABLE via SQL without USING may go through Spark's v1 Hive path before reaching the v2 catalog — see Known Limitations.
  • loadTable optimization: uses tableExists pre-check instead of Iceberg's exception-driven routing, avoiding exception overhead for non-Lance table loads. A TOCTOU safety catch ensures correctness under concurrent modification.
  • Dual-existence logging: dropTable logs WARN, purgeTable logs ERROR when a table exists in both Lance and delegate catalogs — Iceberg has no such detection.
  • Composition over inheritance: wraps an internal LanceNamespaceSparkCatalog via buildLanceCatalog() rather than extending BaseLanceNamespaceSparkCatalog, matching Iceberg's approach.

Changes

File Description
BaseLanceNamespaceSparkSessionCatalog.java Base class: routing, delegation, config validation, RollbackStagedTable
LanceNamespaceSparkSessionCatalog.java (3.4, 3.5) Thin version-specific subclasses
BaseTestSparkSessionCatalog.java 14 integration tests
TestSparkSessionCatalog.java (3.4, 3.5) Concrete test classes
docs/src/config.md Session catalog setup, default-provider docs

Test plan

  • 14 tests pass on Spark 3.5 (305 total, 0 failures)
  • 14 tests pass on Spark 3.4 (256 total, 0 failures)
  • Spark 4.0 cross-compilation clean
  • make lint clean (Checkstyle + Spotless)
  • SQL extension tests: OPTIMIZE, VACUUM, CREATE INDEX through session catalog
  • Non-Lance SQL extension graceful failure (OPTIMIZE on parquet → clear error)
  • Config validation: invalid default-provider / drop-behaviorIllegalArgumentException
  • Lance + Parquet coexistence: both table types queryable in same catalog

Known limitations

Confirmed not applicable:

  • No format interception — unlike Iceberg's parquet-enabled/avro-enabled/orc-enabled flags, USING parquet always routes to the delegate. Lance has no file-format wrapping capability, so this feature does not apply.

Needs investigation:

  • default-provider config may not take effect via SQL — in testing, CREATE TABLE t(id INT) without USING goes through Spark's v1 Hive code path (LazySimpleSerDe) before reaching the v2 catalog's createTable. The default-provider routing works via the catalog API directly. Further investigation is needed to determine whether Spark configuration (e.g., spark.sql.catalogImplementation) or other settings can route no-USING DDL through the v2 path.
  • Lance functions not resolvable via SQL in session cataloglance_fragment_or_rand is registered via FunctionCatalog and works through the catalog API (e.g., Lance's internal write distribution), but SELECT lance_fragment_or_rand(0) FROM t fails with UNRESOLVED_ROUTINE. This may be a namespace configuration issue rather than a fundamental Spark limitation — Iceberg's catalog functions (e.g., system.rollback_to_snapshot) are resolvable via SQL. Needs further investigation.
  • Cross-provider ALTER TABLE not supported — routing is based on table existence, so altering a Parquet table into Lance (or vice versa) is not handled. It may be possible to detect provider-change in TableChange[] arguments, but this needs investigation. Current workaround: use CREATE TABLE ... AS SELECT.

Not implemented (follow-up):

  • listTables returns delegate-only listing — Lance-only tables may not appear in SHOW TABLES. Could be addressed by merging Lance + delegate listings with deduplication. Current behavior matches Iceberg's session catalog.
  • No recursive retry in staging replacestageReplace/stageCreateOrReplace have a TOCTOU window without Iceberg's recursive retry. Low priority since concurrent table replacement is uncommon.

Closes #222

@github-actions github-actions bot added the enhancement New feature or request label Mar 20, 2026
@LuciferYang LuciferYang changed the title feat: support session catalog feat: support session catalog (#222) Mar 20, 2026
@LuciferYang LuciferYang changed the title feat: support session catalog (#222) feat: support session catalog Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support session catalog

1 participant