Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC combine to one command and handle deduplications #2196

Merged
merged 6 commits into from
Jul 6, 2021

Conversation

guy-har
Copy link
Contributor

@guy-har guy-har commented Jul 5, 2021

No description provided.

Console.err.println(
"Usage: ... <repo_name> <runID> s3://storageNamespace/prepared_commits_table s3://storageNamespace/output_destination_table"
"Usage: ... <repo_name> <region>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the region - can we use withForceGlobalBucketAccessEnabled to start the client without the region?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats for the previous SDK version. seems like they didn't solve it yet for the current version.
aws/aws-sdk-java-v2#2229
aws/aws-sdk-java-v2#52

val addressesDFLocation = args(3)

val region = args(1)
val previousRunID = "" //args(2) // TODO(Guys): get previous runID from arguments or from storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're letting go of the previous run feature for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It will be the first thing to add after the release

}

private def subtractDeduplications(expired: Dataset[Row], activeRangesDF: Dataset[Row], conf: APIConfigurations, repo: String, spark: SparkSession) : Dataset[Row]= {
val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet
val deleteCandidateAddresses: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet

val ranges : Seq[String] = activeRangesDF.select("range_id").collect().map(_.getString(0)).toSeq.distinct
val rangesRDD = spark.sparkContext.parallelize(ranges)

val activeAddresses =rangesRDD.flatMap(range=> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val activeAddresses =rangesRDD.flatMap(range=> {
val activeDeleteCandidateAddresses = activeRangeRDD.flatMap(range => {

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the code that selects things, IIUC it will not let Spark do its thing and parallelize the run properly :-(

val location =
new ApiClient(conf.apiURL, conf.accessKey, conf.secretKey).getRangeURL(repo, rangeID)
SSTableReader
.forRange(new Configuration(), location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this truly the configuration we want here?


private def subtractDeduplications(expired: Dataset[Row], activeRangesDF: Dataset[Row], conf: APIConfigurations, repo: String, spark: SparkSession) : Dataset[Row]= {
val expiredAddr: Set[String] = expired.select("address").collect().map(_.getString(0)).toSet
val ranges : Seq[String] = activeRangesDF.select("range_id").collect().map(_.getString(0)).toSeq.distinct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand: doesn't collect directly move everything to the driver program and destroy parallelism? At the very least, I would run distinct or its equivalent on the cluster.

Comment on lines 243 to 245
val activeAddresses =rangesRDD.flatMap(range=> {
getEntryAddressesIfInSet(range, conf, repo, expiredAddr)
}).map(x => Row(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a flatMap with side data. I think that it will be somewhat tricky, because expiredAddr needs to be broadcast to all the workers. It looks like it is some kind of JOIN, so I would prefer if you could write ranges as an RDD (or any other Spark table, I don't really know the difference...) and join with that here.

@guy-har guy-har requested a review from arielshaqed July 5, 2021 11:28
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Please add a comment on l. 42 about what an empty location means.

if (metaRangeID != "") {
val metaRange = metadataApi.getMetaRange(repoName, metaRangeID)
val location = metaRange.getLocation
URI.create(getStorageNamespace(repoName) + "/" + location).normalize().toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you probably know, I am not a fan of of constructing URIs by hand and prefer resolve: that is specified by a standard. The issue here is sensitivity to trailing and leading slashes.
(Not a blocker)

(new String(range.id), range.message.minKey.toByteArray, range.message.maxKey.toByteArray)
)
.toSet
if (location == "") Set[(String, Array[Byte], Array[Byte])]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: when is location == "" a valid return? If it has special meaning, please add a comment to document that meaning.

@guy-har
Copy link
Contributor Author

guy-har commented Jul 6, 2021

@arielshaqed, Thanks for the review, PTAL

@guy-har guy-har requested a review from arielshaqed July 6, 2021 08:27
@guy-har guy-har marked this pull request as ready for review July 6, 2021 08:27
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks...

@guy-har guy-har merged commit 80405c0 into master Jul 6, 2021
@guy-har guy-har deleted the feature/gc-combined branch July 6, 2021 12:19
@guy-har guy-har linked an issue Jul 7, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[GC] - Logic for finding expired addresses
3 participants