Skip to content

feat: support update/merge-into using RewriteColumns mode#199

Open
fangbo wants to merge 3 commits intolance-format:mainfrom
fangbo:update-rewrite-column
Open

feat: support update/merge-into using RewriteColumns mode#199
fangbo wants to merge 3 commits intolance-format:mainfrom
fangbo:update-rewrite-column

Conversation

@fangbo
Copy link
Collaborator

@fangbo fangbo commented Feb 3, 2026

This is for #166

  1. add a new parameter rewrite_columns to specify update/merge-into to use RewriteColumns mode. If the parameter value is false RewriteRows mode is used which means that the rows are deleted and new updated rows are inserted.
  2. UpdateColumnsExtractor rule is used to extract updated columns from spark sql: update or merge into. The specific updated columns will be injected to LancePositionDeltaOperation.
  3. if rewrite_columns is true, LancePositionDeltaOperation.representUpdateAsDeleteAndInsert return false. It means that LanceDeltaWriter.update will be invoked for columns updating.
  4. In LanceDeltaWriter.update , the new updated columns values are collected for a Fragment. Then Fragment.updateColumns is invoked to update the specific columns.

@github-actions github-actions bot added the enhancement New feature or request label Feb 3, 2026
@fangbo fangbo changed the title feat: support update using RewriteColumns mode feat: support update/merge into using RewriteColumns mode Feb 4, 2026
@fangbo fangbo changed the title feat: support update/merge into using RewriteColumns mode feat: support update/merge-into using RewriteColumns mode Feb 4, 2026
@fangbo
Copy link
Collaborator Author

fangbo commented Feb 4, 2026

@jackye1995 @jiaoew1991 @hamersaw Do you think this approach is reasonable ?

@fangbo fangbo force-pushed the update-rewrite-column branch 2 times, most recently from 6799b3a to 9e01904 Compare February 6, 2026 02:21
}

} catch {
case _: Exception =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not just fail silently

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's my mistake. I have made some modification and throw a RuntimeException to inform users to set spark.sql.lance.rewrite_columns to false to disable this feature.

"spark.sql.extensions", "org.lance.spark.extensions.LanceSparkSessionExtensions")
.config("spark.sql.catalog." + catalogName + ".impl", "dir")
.config("spark.sql.catalog." + catalogName + ".root", tempDir.toString())
.config("spark.sql.catalog." + catalogName + ".storage.rewrite_columns", "false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems strange that we are making this a storage option. I think this should be more like a Spark SQL conf, so we can do something like SET to enable/disable it if necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion ! I have defined a key spark.sql.lance.rewrite_columns in SparkUtil.REWRITE_COLUMNS. The method SparkUtil#rewriteColumns check the rewrite mode from spark session configuration.

writer.writeBatch();
writer.end();
} catch (IOException e) {
throw new RuntimeException("Cannot write schema root", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to rethrow as RuntimeException? The method already does throws IOException

private final boolean useQueuedWriteBuffer;
private final int queueDepth;
private final int batchSize;
private final boolean rewriteColumns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is added but never used?

}

public void setUpdatedColumns(List<String> updatedColumns) {
LOG.info("Set updated columns: {}", updatedColumns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a debug message, or could just be removed?

private final Set<Long> fieldsModified = new HashSet<>();
private final Map<Integer, FragmentMetadata> updatedFragments = new HashMap<>();

private int currentUpdateFragmentId = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Optional instead to be clear?

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good to me, could you also update the related documentation?

@fangbo
Copy link
Collaborator Author

fangbo commented Feb 11, 2026

mostly looks good to me, could you also update the related documentation?

@jackye1995 Thanks for your review. I have fixed the comments and added documentation in update.md


## Column Rewrite Mode

`lance-spark` introduces a column rewrite mode for `UPDATE` and `MERGE INTO` operations, which can significantly improve performance for narrow updates that only affect a few columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also update the MERGE doc? Currently the update doc talks about both update and merge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also update the MERGE doc? Currently the update doc talks about both update and merge.

Thanks for your advice. I have added a new documentation merge-into.md

@fangbo fangbo force-pushed the update-rewrite-column branch from 4d62dc0 to 4ceb03b Compare February 24, 2026 02:28
@fangbo
Copy link
Collaborator Author

fangbo commented Feb 25, 2026

@jackye1995 I have rebased this PR and fixed the issues you commented. Could you please review it again? Thanks a lot.

@fangbo fangbo force-pushed the update-rewrite-column branch 3 times, most recently from 6f9677d to e4fe695 Compare February 26, 2026 02:22
@fangbo fangbo force-pushed the update-rewrite-column branch from a963cae to ececb75 Compare March 18, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants