Skip to content

feat: enable create / drop tags and using tags as version on select queries#198

Draft
hamersaw wants to merge 3 commits intolance-format:mainfrom
hamersaw:feature/support-tags
Draft

feat: enable create / drop tags and using tags as version on select queries#198
hamersaw wants to merge 3 commits intolance-format:mainfrom
hamersaw:feature/support-tags

Conversation

@hamersaw
Copy link
Collaborator

@hamersaw hamersaw commented Feb 2, 2026

Adding support for tags in various APIs:

Spark SQL

To create a new tag using the specified "" or latest if not provided.
ALTER TABLE <table> CREATE TAG <tag> [VERSION AS OF <version>]

To delete an existing tag
ALTER TABLE <table> DROP TAG <tag>

To query a table using tag as version
SELECT * FROM <table> VERSION AS OF <tag>

Spark API

spark.read()
    .option("version", "<tag>")
    ...

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
…' support

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
@fangbo
Copy link
Collaborator

fangbo commented Feb 3, 2026

Hi, @hamersaw I have a question to discuss.

Lance has branch feature. So, if we support to query from a branch, what do you think the Spark SQL's grammar to specify branch and tag ?

If we use:

SELECT * FROM <table> VERSION AS OF <tag>

We can not define the specific branch in the sql.

@hamersaw
Copy link
Collaborator Author

hamersaw commented Feb 3, 2026

Hi, @hamersaw I have a question to discuss.

Lance has branch feature. So, if we support to query from a branch, what do you think the Spark SQL's grammar to specify branch and tag ?

If we use:

SELECT * FROM <table> VERSION AS OF <tag>

We can not define the specific branch in the sql.

@fangbo , this is next up on my "random backfill" TODO. I tried to add an ON BRANCH <branch> clause to spark SQL statements because that would be wildly ergonomic - ex. SELECT * FROM <table> [ON BRANCH <branch>] [VERSION AS OF <tag>] but there is no clean way to do that without rewriting most of the grammer to be lance specific (similar to how ALTER TABLE <table> CREATE TAG is proposed here -- built on your COLUMN work).

I looked around a bit and saw Iceberg supports this by adding a prefixed ID to the table (ex. SELECT * FROM db.table.branch_foo). Without major grammer updates this seems the least intrusive approach because it needs to be supported across a TON of statments (ex. SELECT, ALTER TABLE, etc). I'm certainly going to open a discussion before submitting a PR on this and am VERY interested in others thoughts!

@fangbo
Copy link
Collaborator

fangbo commented Feb 4, 2026

Hi, @hamersaw I have a question to discuss.
Lance has branch feature. So, if we support to query from a branch, what do you think the Spark SQL's grammar to specify branch and tag ?
If we use:

SELECT * FROM <table> VERSION AS OF <tag>

We can not define the specific branch in the sql.

@fangbo , this is next up on my "random backfill" TODO. I tried to add an ON BRANCH <branch> clause to spark SQL statements because that would be wildly ergonomic - ex. SELECT * FROM <table> [ON BRANCH <branch>] [VERSION AS OF <tag>] but there is no clean way to do that without rewriting most of the grammer to be lance specific (similar to how ALTER TABLE <table> CREATE TAG is proposed here -- built on your COLUMN work).

I looked around a bit and saw Iceberg supports this by adding a prefixed ID to the table (ex. SELECT * FROM db.table.branch_foo). Without major grammer updates this seems the least intrusive approach because it needs to be supported across a TON of statments (ex. SELECT, ALTER TABLE, etc). I'm certainly going to open a discussion before submitting a PR on this and am VERY interested in others thoughts!

Thanks for your reply. On the other hand, insert/update/delete/merge into should also be supported for branch. One of my customer currently use lance branch . I think iceberg's adding a prefixed ID to the table (ex. SELECT * FROM db.table.branch_foo) is a feasible solution.

@jackye1995
Copy link
Contributor

jackye1995 commented Feb 11, 2026

I think I agree with @fangbo that we should probably treat tag and branch both using VERSION AS OF, instead of using a new syntax which requires SQL extension for branch.

What about the following:

SELECT * FROM TABLE VERSION AS OF "ref/<branch_name>/<tag_name_or_version_number>"

and branch_name=main means the main branch?

For examples:

SELECT * FROM TABLE VERSION AS OF "ref/main/1"

SELECT * FROM TABLE VERSION AS OF "ref/main/v1.0"

SELECT * FROM TABLE VERSION AS OF "ref/staging/10"

SELECT * FROM TABLE VERSION AS OF "ref/staging/v1.0"

There is a problem of what if the branch name has a / in it. My thinking is that the character after ref is used as the delimiter, so we can also do ref$release/v2.0$10.

What do we htink? @fangbo @hamersaw

@fangbo
Copy link
Collaborator

fangbo commented Feb 11, 2026

SELECT * FROM TABLE VERSION AS OF "ref/<branch_name>/<tag_name_or_version_number>"

For branch in lance, data can be insert/updated/deleted and schema also can be changed. One of our customers treats branch as a new table in Spark and execute Spark dml like: update ... delete from, merge into ... on this branch.

Although VERSION AS OF "ref/<branch_name>/<tag_name_or_version_number> is a good expression for select, Spark currently does not support update, delete , merge into using this expression. So I think it is a tricky problem about how to express branches in Spark DML.

@hamersaw
Copy link
Collaborator Author

SELECT * FROM TABLE VERSION AS OF "ref/<branch_name>/<tag_name_or_version_number>"

For branch in lance, data can be insert/updated/deleted and schema also can be changed. One of our customers treats branch as a new table in Spark and execute Spark dml like: update ... delete from, merge into ... on this branch.

Although VERSION AS OF "ref/<branch_name>/<tag_name_or_version_number> is a good expression for select, Spark currently does not support update, delete , merge into using this expression. So I think it is a tricky problem about how to express branches in Spark DML.

I think that's the difficulty. Adding some level of branch support int he VERSION AS OF <version> clause is relatively simple. IMO a better approach would be to figure out first class branch support across the Lance Spark SQL extension so that this works with everything (ex. INSERT, UPDATE, etc).

@jackye1995
Copy link
Contributor

Spark currently does not support update, delete , merge into using this expression. So I think it is a tricky problem about how to express branches in Spark DML.

The time travel syntax is purely for read, that's expected.

For reference, in Iceberg we did 2 approaches for DML:

  1. table name convention: use some way to express branch in table name, for example we can do <table_name>__branch/<branch_name>. This also works for SELECT, and it's actually ironically easier to use than the dedicated AS OF time travel syntax, because it is very friendly for no-code applications to integrate with without the need to change underlying SQL statement. You can either use table name convention, or use time travel syntax for SELECT, you cannot specify both.
  2. use environment variable: this is typically more used for write-audit-publish workflows, that you set a env config or Spark option like WAP_BRANCH=<branch_name>, and then reads and writes automatically switch to that branch.

What do we think? @fangbo @hamersaw

@hamersaw
Copy link
Collaborator Author

hamersaw commented Feb 11, 2026

Spark currently does not support update, delete , merge into using this expression. So I think it is a tricky problem about how to express branches in Spark DML.

The time travel syntax is purely for read, that's expected.

For reference, in Iceberg we did 2 approaches for DML:

1. table name convention: use some way to express branch in table name, for example we can do `<table_name>__branch/<branch_name>`. This also works for SELECT, and it's actually ironically easier to use than the dedicated `AS OF` time travel syntax, because it is very friendly for no-code applications to integrate with without the need to change underlying SQL statement. You can either use table name convention, or use time travel syntax for SELECT, you cannot specify both.

2. use environment variable: this is typically more used for write-audit-publish workflows, that you set a env config or Spark option like `WAP_BRANCH=<branch_name>`, and then reads and writes automatically switch to that branch.

What do we think? @fangbo @hamersaw

I really like this idea. This is what I trying to propose, but certainly put more eloquently. I think if we support branch integrated in the the table identifier than that covers all of our bases. We could still add to VERSION AS OF for syntactic sugar, but I think it's a bit more ergonomic to centralize branches in the table ID.

@fangbo
Copy link
Collaborator

fangbo commented Feb 12, 2026

  1. table name convention: use some way to express branch in table name, for example we can do <table_name>__branch/<branch_name>. This also works for SELECT, and it's actually ironically easier to use than the dedicated AS OF time travel syntax, because it is very friendly for no-code applications to integrate with without the need to change underlying SQL statement. You can either use table name convention, or use time travel syntax for SELECT, you cannot specify both.

+1, Good idea. Actually this method treats the branch as a normal table. I think it aligns better with developers' usage habits.

@jackye1995
Copy link
Contributor

I think it's a bit more ergonomic to centralize branches in the table ID.

I agree. As someone who was a part of the group that originally designed the syntax, I think at this point it's a failed experiment. Engines never agreed upon the right syntax (FOR SYSTEM_VERSION AS OF vs VERSION AS OF), and it's hard to integrate because caller has to change SQL syntax, and it's just never integrated into the write path properly.

I am good with directly implementing it in table identifier to support read and write, and we can implement the syntax sugar later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants