[CELEBORN-1388] Use finer grained locks in changePartitionManager #2462

CodingCat · 2024-04-16T16:40:50Z

What changes were proposed in this pull request?

this PR proposes to use finer grained lock in changePartitionManager when handling requests for different partitions

Why are the changes needed?

we observed the intensive competition of locks when there are many partition got split. most of change-partition-executor threads are competing for the concurrenthashmap used in ChangePartitionManager...this concurrentHashMap is holding request per partition but we are lock at the whole map instead of per partition level,

with this change, the driver memory footprint is significantly reduced due to the increased processing throughput...

Does this PR introduce any user-facing change?

one more configs

How was this patch tested?

prod

SteNicholas · 2024-04-16T16:57:46Z

@CodingCat, please follow the contribution guide to run UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite when updating CelebornConf.

CodingCat · 2024-04-16T17:47:38Z

UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite

eh? I remember this was part of unit test to detect any inconsistency between CelebornConf and doc file, it is removed?

CodingCat · 2024-04-16T17:50:10Z

and after I ran it, it didn't generate any new change to the markdown file? anything I missed?

SteNicholas · 2024-04-16T17:57:27Z

@CodingCat, is celeborn.client.shuffle.batchHandleChangePartition.parallelism internal? The internal config option does not generate confige document change. Therefore, you could not invoke internal for this config option.

SteNicholas · 2024-04-16T18:05:20Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+    buildConf("celeborn.client.shuffle.batchHandleChangePartition.parallelism")
+      .categories("client")
+      .internal
+      .doc("max number of change partition requests which can be concurrently processed ")


Please keep the first letter of the document's first word capitalized.

mridulm · 2024-04-17T02:05:20Z

client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala

+                              None
+                            }
+                          }
+                        }.filter(_.isDefined).map(_.get).toArray


I am not so sure of this change - it feels like it will have a lot more contention - as each entry in the requests map will need to acquire a lock - while uncontented locks are cheap to acquire, it is not zero cost still.

If you have a test bed to validate perf, how about this ?

val batchPartitions = inBatchPartitions.get(shuffleId) val distinctPartitions = requests.synchronized { // For each partition only need handle one request requests.asScala.filter { case (partitionId, _) => !batchPartitions.contains(partitionId) }.map { case (partitionId, request) => batchPartitions.add(partitionId) request.asScala.maxBy(_.epoch) }.toArray }

Essentially minimize the time within the synchronized block itself by removing unnecessary costs.

based on our observation in our production system, it won't bring more competition ...

if we use requests.synchronized, all celeborn-dispatcher threads + all celeborn-client-life-cycle-manager-change-partition-executor will all compete for the same object for locking, even they are likely working on different partitions ... check the following screenshots

after this change, with a huge spark application of 300TB shuffle data, I don't see such intensive locking competition anymore

Fair enough. Thanks for sharing the stack trace.
I would suggest that the changes I gave are relevant irrespective of the locking strategy - as it will minimize the time within a critical section.

it feels like it will have a lot more contention - as each entry in the requests map will need to acquire a lock

To reduce the frequency of acquiring locks, I think we can calculate the lock buckets for each partition ids first, then group the partition ids by the lock bucket, then acquire lock and process each group (in random order). Though I'm not sure how beneficial this will be.

mridulm · 2024-04-17T02:15:02Z

client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala

@@ -151,7 +156,7 @@ class ChangePartitionManager(
      oldPartition,
      cause)

-    requests.synchronized {
+    locks(partitionId % locks.length).synchronized {
      if (requests.containsKey(partitionId)) {
        requests.get(partitionId).add(changePartition)
        logTrace(s"[handleRequestPartitionLocation] For $shuffleId, request for same partition" +


We could replace this with a computeIfAbsent ?
Something like:

requests.synchronized { var newEntry = false val set = requests.computeIfAbsent(partitionId, (v1: Integer) => { // If new slot for the partition has been allocated, reply and return. // Else register and allocate for it. getLatestPartition(shuffleId, partitionId, oldEpoch).foreach { latestLoc => context.reply( partitionId, StatusCode.SUCCESS, Some(latestLoc), lifecycleManager.workerStatusTracker.workerAvailable(oldPartition)) logDebug(s"New partition found, old partition $partitionId-$oldEpoch return it." + s" shuffleId: $shuffleId $latestLoc") return } newEntry = true new util.HashSet[ChangePartitionRequest]() }) set.add(changePartition) if (!newEntry) { logTrace(s"[handleRequestPartitionLocation] For $shuffleId, request for same partition" + s"$partitionId-$oldEpoch exists, register context.") } }

if I understand the suggest code correctly, you essentially create a set in requests for each partition and keep adding a request to it,

I thought the same when iterating on the PR, however it turns out we cannot do it ....

basically it is not what the original code was doing... the original code always add a new set containing a single request to the hash map, i.e. line 178 - 179

The original code is doing the same.
If partition exists, it will add to the set - else create new and add with the entry.
(Removing other parts of the code, it is essentially)

if (requests.containsKey(partitionId)) { requests.get(partitionId).add(changePartition) } else { // an early exit condition, followed by: val set = new util.HashSet[ChangePartitionRequest]() set.add(changePartition) requests.put(partitionId, set) }

It is probing the map multiple times though, which is something we can avoid.

(the return in the getLatestPartition case I suggested looks wrong though - we should return null and exit if set is null)

requests.putIfAbsent(partitionId, set) requests.get(partitionId).synchronized { getLatestPartition(shuffleId, partitionId, oldEpoch).foreach { latestLoc => context.reply( partitionId, StatusCode.SUCCESS, Some(latestLoc), lifecycleManager.workerStatusTracker.workerAvailable(oldPartition)) logDebug(s"New partition found, old partition $partitionId-$oldEpoch return it." + s" shuffleId: $shuffleId $latestLoc") return } requests.get(partitionId).add(changePartition) }

this was my original code, somehow this makes the application stuck , that's why I feel somehow this putIfAbsent approach changed the original semantics in a stealthy way

This is strictly not the same as what exists in main branch - I have not analyzed it greater detail, but the critical sections are different.
Note that the changes I proposed above are to ensure we remove avoidable probes into the map, and improve performance while not changing the critical sections ... but if the version I proposed does cause deadlocks/hangs, I would be very curious to know why ! (stack trace would definitely help) thanks.

i have updated the code , will run more test in our env

waitinfuture · 2024-04-19T02:19:50Z

client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala

+                        val requestSet = inBatchPartitions.get(shuffleId)
+                        requests.asScala.map { case (partitionId, request) =>
+                          locks(partitionId % locks.length).synchronized {
+                            if (!inBatchPartitions.contains(partitionId)) {


Do you mean requestSet.contains(partitionId) here?

Since requestSet is just util.HashSet, it's not thread safe. Multiple threads can concurrently modifyrequestSet if they are processing different partition ids, which I think may cause undefined behavior. Maybe we need to change it to ConcurrentHashMap.

yeah, I meant requestSet

waitinfuture · 2024-04-19T02:39:19Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -3899,6 +3901,14 @@ object CelebornConf extends Logging {
      .booleanConf
      .createWithDefault(true)

+  val CLIENT_BATCH_HANDLE_CHANGE_PARTITION_PARALLELISM: ConfigEntry[Int] =
+    buildConf("celeborn.client.shuffle.batchHandleChangePartition.parallelism")


Maybe celeborn.client.shuffle.batchHandleChangePartition.partitionBuckets is better?

waitinfuture

Just another ,optimization, in ChangePartitionManager#replySuccess and ChangePartitionManager#replyFailure, we should remove from requestsMap before remove from inBatchPartitions to avoid redundant change partitions in case. And removing from inBatchPartitions should be guarded by locks.

cc @AngersZhuuuu @RexXiong @FMX could you also take a look at this PR?

waitinfuture · 2024-04-20T12:29:54Z

client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala

+                        val requestSet = inBatchPartitions.get(shuffleId)
+                        requests.asScala.map { case (partitionId, request) =>
+                          locks(partitionId % locks.length).synchronized {
+                            if (!requestSet.contains(partitionId)) {


contains for ConcurrentHashMap is actually containsValue. It's better to use ConcurrentHashMap.newKeySet() instead of ConcurrentHashMap[Integer, Unit]

oops, fixed

FMX · 2024-04-22T06:56:05Z

client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala

-    requests.synchronized {
-      if (requests.containsKey(partitionId)) {
-        requests.get(partitionId).add(changePartition)
+    locks(partitionId % locks.length).synchronized {


@CodingCat I think the "partition Id" of different shuffles can be repeated. The lock is for the same "shuffleId" in the previous implementation but the lock can be contended by the same "partition Id" of different stages in your new implementation. Although a spark application won't run too many stages concurrently, but the spark thrift server might run many stages.

The locks variable can be changed to avoid the lock contention of different stages.
private val locks = JavaUtils.newConcurrentHashMap[Int,Array[AnyRef]]()
I think creating an array of AnyRef won't cost more than the contended locks. 256 AnyRef objects will consume 2 kb of memory, this suggestion won't introduce memory pressure.

excellent point! just changed the code

waitinfuture

LGTM, thanks! Merging to main(v0.5.0)

### What changes were proposed in this pull request? this PR proposes to use finer grained lock in changePartitionManager when handling requests for different partitions ### Why are the changes needed? we observed the intensive competition of locks when there are many partition got split. most of change-partition-executor threads are competing for the concurrenthashmap used in ChangePartitionManager...this concurrentHashMap is holding request per partition but we are lock at the whole map instead of per partition level, with this change, the driver memory footprint is significantly reduced due to the increased processing throughput... ### Does this PR introduce _any_ user-facing change? one more configs ### How was this patch tested? prod Closes apache#2462 from CodingCat/finer_grained_locks. Authored-by: CodingCat <zhunansjtu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

this PR proposes to use finer grained lock in changePartitionManager when handling requests for different partitions we observed the intensive competition of locks when there are many partition got split. most of change-partition-executor threads are competing for the concurrenthashmap used in ChangePartitionManager...this concurrentHashMap is holding request per partition but we are lock at the whole map instead of per partition level, with this change, the driver memory footprint is significantly reduced due to the increased processing throughput... one more configs prod Closes apache#2462 from CodingCat/finer_grained_locks. Authored-by: CodingCat <zhunansjtu@gmail.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>

use finer grained locks in changePartitionManager

263a4c3

SteNicholas reviewed Apr 16, 2024

View reviewed changes

update docs

24fbd9c

mridulm reviewed Apr 17, 2024

View reviewed changes

cxzl25 changed the title ~~[CELEBORN-1388] use finer grained locks in changePartitionManager~~ [CELEBORN-1388] Use finer grained locks in changePartitionManager Apr 18, 2024

CodingCat added 3 commits April 18, 2024 07:19

fix

24c6905

stylistic fixes

f7e0999

to be compatible with scala 2.11

f220fcd

waitinfuture reviewed Apr 19, 2024

View reviewed changes

CodingCat added 4 commits April 18, 2024 21:07

fix typo

d96bd8e

addr comments

c307b71

addr comments 2

a6fa04b

fix build

d6763b7

waitinfuture reviewed Apr 20, 2024

View reviewed changes

CodingCat added 2 commits April 20, 2024 10:14

addr comments

73de215

use new keyset

93d4eb6

FMX reviewed Apr 22, 2024

View reviewed changes

CodingCat added 2 commits April 22, 2024 09:24

make locks per shuffle

ac43719

stylistic fixes

f3c8f0e

CodingCat changed the title ~~[CELEBORN-1388] Use finer grained locks in changePartitionManager~~ [UNDER TESTING][CELEBORN-1388] Use finer grained locks in changePartitionManager Apr 23, 2024

CodingCat changed the title ~~[UNDER TESTING][CELEBORN-1388] Use finer grained locks in changePartitionManager~~ [CELEBORN-1388] Use finer grained locks in changePartitionManager Apr 30, 2024

waitinfuture approved these changes Apr 30, 2024

View reviewed changes

waitinfuture closed this in 9f30479 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1388] Use finer grained locks in changePartitionManager #2462

[CELEBORN-1388] Use finer grained locks in changePartitionManager #2462

CodingCat commented Apr 16, 2024 •

edited

Loading

SteNicholas commented Apr 16, 2024

CodingCat commented Apr 16, 2024

CodingCat commented Apr 16, 2024 •

edited

Loading

SteNicholas commented Apr 16, 2024 •

edited

Loading

SteNicholas Apr 16, 2024

mridulm Apr 17, 2024 •

edited

Loading

CodingCat Apr 17, 2024

mridulm Apr 17, 2024 •

edited

Loading

waitinfuture Apr 19, 2024

mridulm Apr 17, 2024

CodingCat Apr 17, 2024

mridulm Apr 17, 2024 •

edited

Loading

CodingCat Apr 17, 2024

mridulm Apr 18, 2024 •

edited

Loading

CodingCat Apr 19, 2024

waitinfuture Apr 19, 2024 •

edited

Loading

CodingCat Apr 19, 2024

waitinfuture Apr 19, 2024

CodingCat Apr 19, 2024

waitinfuture left a comment •

edited

Loading

waitinfuture Apr 20, 2024

CodingCat Apr 20, 2024

FMX Apr 22, 2024

CodingCat Apr 22, 2024

waitinfuture left a comment •

edited

Loading

[CELEBORN-1388] Use finer grained locks in changePartitionManager #2462

[CELEBORN-1388] Use finer grained locks in changePartitionManager #2462

Conversation

CodingCat commented Apr 16, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SteNicholas commented Apr 16, 2024

CodingCat commented Apr 16, 2024

CodingCat commented Apr 16, 2024 • edited Loading

SteNicholas commented Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

mridulm Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waitinfuture Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waitinfuture left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waitinfuture left a comment • edited Loading

Choose a reason for hiding this comment

CodingCat commented Apr 16, 2024 •

edited

Loading

CodingCat commented Apr 16, 2024 •

edited

Loading

SteNicholas commented Apr 16, 2024 •

edited

Loading

mridulm Apr 17, 2024 •

edited

Loading

mridulm Apr 17, 2024 •

edited

Loading

mridulm Apr 17, 2024 •

edited

Loading

mridulm Apr 18, 2024 •

edited

Loading

waitinfuture Apr 19, 2024 •

edited

Loading

waitinfuture left a comment •

edited

Loading

waitinfuture left a comment •

edited

Loading