[Cherry-pick] Do not split commits that contain DVs in CDF streaming #1904

larsk-db · 2023-07-12T07:08:54Z

Cherry-pick of 9f70ee5 for branch-2.4.

When the add and remove file entries for a commit that contains DVs gets split into different batches, we don't recognise that they belong together and we need to produce CDF by calculating the diff of the two DVS.
This PR prevents splitting of these commits, just like we do for AddCDCFile actions to prevent splitting the update_{pre/post}_image information.

GitOrigin-RevId: 11438c01ecc69f7c55c3a8826fa540f6b984e4c4

- When the add and remove file entries for a commit that contains DVs gets split into different batches, we don't recognise that they belong together and we need to produce CDF by calculating the diff of the two DVS. - This PR prevents splitting of these commits, just like we do for AddCDCFile actions to prevent splitting the update_{pre/post}_image information. GitOrigin-RevId: 11438c01ecc69f7c55c3a8826fa540f6b984e4c4

jaceklaskowski · 2023-07-12T11:05:58Z

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSource.scala

@@ -1011,12 +1011,12 @@ case class DeltaSource(
    var commitProcessedInBatch = false

    /**
-     * This overloaded method checks if all the AddCDCFiles for a commit can be accommodated by
+     * This overloaded method checks if all the FileActions for a commit can be accommodated by


nit: Remove "This overloaded method"

Move the description to be @return and describe what it does (returns).

jaceklaskowski · 2023-07-12T11:08:06Z

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSource.scala

-        actions.foldLeft(0L) { (l, r) => l + r.size }
+    def admit(fileActions: Seq[FileAction]): Boolean = {
+      def getSize(actions: Seq[FileAction]): Long = {
+        actions.foldLeft(0L) { (l, r) => l + r.getFileSize }


nit: Replace l with totalFileSize (or at least totalFileSizeSoFar 😉)

jaceklaskowski · 2023-07-12T11:13:56Z

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSourceCDCSupport.scala

+              hasAddsOrRemoves(indexedFile) || hasNoFileActionAndStartIndex(indexedFile)
+            }
+            .filter(isValidIndexedFile(_, fromVersion, fromIndex, endOffset))
+          val filteredFileActions = filteredFiles.flatMap(f => Option(f.getFileAction))


This flatMap(... => Option(...)) construct makes it hard(er) to get the gist of the change. Why not simply filter(_.getFileAction != null)?

Looks clear(er) and less objects put on the JVM stack.

jaceklaskowski · 2023-07-12T11:17:36Z

core/src/test/scala/org/apache/spark/sql/delta/DeltaCDCStreamSuite.scala

+  test("double delete-only on the same file") {
+    withTempDir { tableDir =>
+      val tablePath = tableDir.toString
+      spark.range(start = 0L, end = 10L, step = 1L, numPartitions = 1).toDF("id")


Why toDF("id")?

Just making the name of column explicit, rather than having to know what spark.range happens to name its column.

larsk-db · 2023-07-12T12:21:42Z

@jaceklaskowski I appreciate the comments, but this is a cherry-pick from master. I don't feel like anything less than a correctness issue justifies diverging the branches, since that'll just make future backports harder.

jaceklaskowski reviewed Jul 12, 2023

View reviewed changes

vkorukanti approved these changes Aug 22, 2023

View reviewed changes

vkorukanti merged commit 9ef19f1 into delta-io:branch-2.4 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-pick] Do not split commits that contain DVs in CDF streaming #1904

[Cherry-pick] Do not split commits that contain DVs in CDF streaming #1904

larsk-db commented Jul 12, 2023 •

edited

Loading

jaceklaskowski Jul 12, 2023

jaceklaskowski Jul 12, 2023

jaceklaskowski Jul 12, 2023

jaceklaskowski Jul 12, 2023

larsk-db Jul 12, 2023

larsk-db commented Jul 12, 2023

[Cherry-pick] Do not split commits that contain DVs in CDF streaming #1904

[Cherry-pick] Do not split commits that contain DVs in CDF streaming #1904

Conversation

larsk-db commented Jul 12, 2023 • edited Loading

jaceklaskowski Jul 12, 2023

Choose a reason for hiding this comment

jaceklaskowski Jul 12, 2023

Choose a reason for hiding this comment

jaceklaskowski Jul 12, 2023

Choose a reason for hiding this comment

jaceklaskowski Jul 12, 2023

Choose a reason for hiding this comment

larsk-db Jul 12, 2023

Choose a reason for hiding this comment

larsk-db commented Jul 12, 2023

larsk-db commented Jul 12, 2023 •

edited

Loading