GC combine to one command and handle deduplications #2196

guy-har · 2021-07-05T07:04:58Z

No description provided.

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

johnnyaug · 2021-07-05T08:00:41Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

      Console.err.println(
-        "Usage: ... <repo_name> <runID> s3://storageNamespace/prepared_commits_table s3://storageNamespace/output_destination_table"
+        "Usage: ... <repo_name> <region>"


About the region - can we use withForceGlobalBucketAccessEnabled to start the client without the region?

Thats for the previous SDK version. seems like they didn't solve it yet for the current version.
aws/aws-sdk-java-v2#2229
aws/aws-sdk-java-v2#52

johnnyaug · 2021-07-05T08:06:33Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

-    val addressesDFLocation = args(3)
-
+    val region = args(1)
+    val previousRunID = "" //args(2) // TODO(Guys): get previous runID from arguments or from storage


So we're letting go of the previous run feature for now?

No, It will be the first thing to add after the release

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

johnnyaug · 2021-07-05T08:30:35Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

+  }
+
+  private def subtractDeduplications(expired: Dataset[Row], activeRangesDF: Dataset[Row], conf: APIConfigurations, repo: String, spark: SparkSession) : Dataset[Row]= {
+    val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet


Suggested change

val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet

val deleteCandidateAddresses: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

johnnyaug · 2021-07-05T08:32:29Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

+    val ranges : Seq[String] = activeRangesDF.select("range_id").collect().map(_.getString(0)).toSeq.distinct
+    val rangesRDD = spark.sparkContext.parallelize(ranges)
+
+    val activeAddresses =rangesRDD.flatMap(range=> {


Suggested change

val activeAddresses =rangesRDD.flatMap(range=> {

val activeDeleteCandidateAddresses = activeRangeRDD.flatMap(range => {

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

arielshaqed

I am not sure about the code that selects things, IIUC it will not let Spark do its thing and parallelize the run properly :-(

arielshaqed · 2021-07-05T09:06:46Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

+    val location =
+      new ApiClient(conf.apiURL, conf.accessKey, conf.secretKey).getRangeURL(repo, rangeID)
+    SSTableReader
+      .forRange(new Configuration(), location)


Is this truly the configuration we want here?

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

arielshaqed · 2021-07-05T09:09:04Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

+
+  private def subtractDeduplications(expired: Dataset[Row], activeRangesDF: Dataset[Row], conf: APIConfigurations, repo: String, spark: SparkSession) : Dataset[Row]= {
+    val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet
+    val ranges : Seq[String] = activeRangesDF.select("range_id").collect().map(_.getString(0)).toSeq.distinct


Not sure I understand: doesn't collect directly move everything to the driver program and destroy parallelism? At the very least, I would run distinct or its equivalent on the cluster.

arielshaqed · 2021-07-05T09:12:37Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

+    val activeAddresses =rangesRDD.flatMap(range=> {
+      getEntryAddressesIfInSet(range, conf, repo, expiredAddr)
+    }).map(x => Row(x))


This looks like a flatMap with side data. I think that it will be somewhat tricky, because expiredAddr needs to be broadcast to all the workers. It looks like it is some kind of JOIN, so I would prefer if you could write ranges as an RDD (or any other Spark table, I don't really know the difference...) and join with that here.

arielshaqed

Thanks!

Please add a comment on l. 42 about what an empty location means.

arielshaqed · 2021-07-06T06:39:15Z

clients/spark/core/src/main/scala/io/treeverse/clients/ApiClient.scala

+    if (metaRangeID != "") {
+      val metaRange = metadataApi.getMetaRange(repoName, metaRangeID)
+      val location = metaRange.getLocation
+      URI.create(getStorageNamespace(repoName) + "/" + location).normalize().toString


As you probably know, I am not a fan of of constructing URIs by hand and prefer resolve: that is specified by a standard. The issue here is sensitivity to trailing and leading slashes.
(Not a blocker)

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

arielshaqed · 2021-07-06T06:48:54Z

clients/spark/core/src/main/scala/io/treeverse/clients/GarbageCollector.scala

-        (new String(range.id), range.message.minKey.toByteArray, range.message.maxKey.toByteArray)
-      )
-      .toSet
+    if (location == "") Set[(String, Array[Byte], Array[Byte])]()


I don't understand: when is location == "" a valid return? If it has special meaning, please add a comment to document that meaning.

guy-har · 2021-07-06T08:26:59Z

@arielshaqed, Thanks for the review, PTAL

arielshaqed

Cool! Thanks...

guy-har requested review from johnnyaug and arielshaqed July 5, 2021 07:05

johnnyaug suggested changes Jul 5, 2021

View reviewed changes

johnnyaug reviewed Jul 5, 2021

View reviewed changes

johnnyaug suggested changes Jul 5, 2021

View reviewed changes

arielshaqed requested changes Jul 5, 2021

View reviewed changes

guy-har requested a review from arielshaqed July 5, 2021 11:28

guy-har added 2 commits July 5, 2021 14:43

GC combine to one command and handle deduplications

b5898dc

Bug fix

316a910

johnnyaug approved these changes Jul 5, 2021

View reviewed changes

Format and clean imports

02ab338

guy-har force-pushed the feature/gc-combined branch from 9c19252 to 02ab338 Compare July 5, 2021 16:29

Fix due to wrong bad merge in rebase

1ab38d7

arielshaqed reviewed Jul 6, 2021

View reviewed changes

guy-har added 2 commits July 6, 2021 10:28

Bug fix

daaea3a

Code review changes

122a269

guy-har requested a review from arielshaqed July 6, 2021 08:27

guy-har marked this pull request as ready for review July 6, 2021 08:27

arielshaqed approved these changes Jul 6, 2021

View reviewed changes

guy-har merged commit 80405c0 into master Jul 6, 2021

guy-har deleted the feature/gc-combined branch July 6, 2021 12:19

guy-har linked an issue Jul 7, 2021 that may be closed by this pull request

[GC] - Logic for finding expired addresses #2094

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC combine to one command and handle deduplications #2196

GC combine to one command and handle deduplications #2196

guy-har commented Jul 5, 2021

johnnyaug Jul 5, 2021

guy-har Jul 5, 2021

johnnyaug Jul 5, 2021

guy-har Jul 5, 2021

johnnyaug Jul 5, 2021

johnnyaug Jul 5, 2021

arielshaqed left a comment

arielshaqed Jul 5, 2021

arielshaqed Jul 5, 2021

arielshaqed Jul 5, 2021

arielshaqed left a comment

arielshaqed Jul 6, 2021

arielshaqed Jul 6, 2021

guy-har commented Jul 6, 2021

arielshaqed left a comment

	val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet
	val deleteCandidateAddresses: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet

	val activeAddresses =rangesRDD.flatMap(range=> {
	val activeDeleteCandidateAddresses = activeRangeRDD.flatMap(range => {

GC combine to one command and handle deduplications #2196

GC combine to one command and handle deduplications #2196

Conversation

guy-har commented Jul 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arielshaqed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guy-har commented Jul 6, 2021

arielshaqed left a comment

Choose a reason for hiding this comment