Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
LanceNamespaceSparkSessionCatalog— aCatalogExtensionimplementation that replaces Spark's built-inspark_catalog, allowing Lance tables to coexist with Hive/Parquet tables in the default catalog without requiring a separate named catalog.This follows the same pattern as Iceberg's
SparkSessionCatalog, with composition-based routing between an internal Lance catalog and the delegate session catalog.Usage:
Key design decisions
default-providerconfig (delegate/lance/error, defaultdelegate): controls routing when the v2 catalog'screateTableis called without a provider (e.g., programmatic catalog access). Unlike Iceberg which defaults to Iceberg, we default todelegateto preserve backward compatibility. Note:CREATE TABLEvia SQL withoutUSINGmay go through Spark's v1 Hive path before reaching the v2 catalog — see Known Limitations.loadTableoptimization: usestableExistspre-check instead of Iceberg's exception-driven routing, avoiding exception overhead for non-Lance table loads. A TOCTOU safety catch ensures correctness under concurrent modification.dropTablelogs WARN,purgeTablelogs ERROR when a table exists in both Lance and delegate catalogs — Iceberg has no such detection.LanceNamespaceSparkCatalogviabuildLanceCatalog()rather than extendingBaseLanceNamespaceSparkCatalog, matching Iceberg's approach.Changes
BaseLanceNamespaceSparkSessionCatalog.javaRollbackStagedTableLanceNamespaceSparkSessionCatalog.java(3.4, 3.5)BaseTestSparkSessionCatalog.javaTestSparkSessionCatalog.java(3.4, 3.5)docs/src/config.mddefault-providerdocsTest plan
make lintclean (Checkstyle + Spotless)default-provider/drop-behavior→IllegalArgumentExceptionKnown limitations
Confirmed not applicable:
parquet-enabled/avro-enabled/orc-enabledflags,USING parquetalways routes to the delegate. Lance has no file-format wrapping capability, so this feature does not apply.Needs investigation:
default-providerconfig may not take effect via SQL — in testing,CREATE TABLE t(id INT)withoutUSINGgoes through Spark's v1 Hive code path (LazySimpleSerDe) before reaching the v2 catalog'screateTable. Thedefault-providerrouting works via the catalog API directly. Further investigation is needed to determine whether Spark configuration (e.g.,spark.sql.catalogImplementation) or other settings can route no-USINGDDL through the v2 path.lance_fragment_or_randis registered viaFunctionCatalogand works through the catalog API (e.g., Lance's internal write distribution), butSELECT lance_fragment_or_rand(0) FROM tfails withUNRESOLVED_ROUTINE. This may be a namespace configuration issue rather than a fundamental Spark limitation — Iceberg's catalog functions (e.g.,system.rollback_to_snapshot) are resolvable via SQL. Needs further investigation.ALTER TABLEnot supported — routing is based on table existence, so altering a Parquet table into Lance (or vice versa) is not handled. It may be possible to detect provider-change inTableChange[]arguments, but this needs investigation. Current workaround: useCREATE TABLE ... AS SELECT.Not implemented (follow-up):
listTablesreturns delegate-only listing — Lance-only tables may not appear inSHOW TABLES. Could be addressed by merging Lance + delegate listings with deduplication. Current behavior matches Iceberg's session catalog.stageReplace/stageCreateOrReplacehave a TOCTOU window without Iceberg's recursive retry. Low priority since concurrent table replacement is uncommon.Closes #222