Don't send on unbuffered channel while holding a lock #397

ejweber · 2024-02-13T21:37:53Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

There is a potential deadlock caused by pm.getProcessesToUpdateConditions. See the linked issue for details.

We only need the ProcessManager lock while we are iterating through the Process map. We should not continue to hold it while sending on an unbuffered channel.

Longhorn 7919 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2024-02-13T22:14:39Z

I tested this by following the reproduce from longhorn/longhorn#7919 (comment) with a modified instance-manager that had these changes AND MountCheckInterval of 250 ms. So far, we survived 125 iterations without deadlock, but I will continue to let it run.

PhanLe1010

LGTM

A side question, do we observe memory and CPU high consumption once there are many go routine stuck in this case? Maybe it could be triggered if we leave the deadlock for a long time and the go routine starts building up.

james-munson

Nice.

ejweber · 2024-02-15T18:47:37Z

A side question, do we observe memory and CPU high consumption once there are many go routine stuck in this case? Maybe it could be triggered if we leave the deadlock for a long time and the go routine starts building up.

Good question. I put my cluster into the stuck state and am monitoring it. For now CPU is stable and low, but it looks like memory is slowly leaking in the instance manager pod as we build up goroutines like the following:

goroutine profile: total 334
77 @ 0x43e7ae 0x44fcf8 0x44fccf 0x46da45 0xa73385 0xa73354 0xa72805 0x92e5d4 0x8fc103 0x90120c 0x8fa079 0x471a41
#	0x46da44	sync.runtime_SemacquireRWMutexR+0x24										/usr/local/go/src/runtime/sema.go:82
#	0xa73384	sync.(*RWMutex).RLock+0x84											/usr/local/go/src/sync/rwmutex.go:71
#	0xa73353	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).findProcess+0x53				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:350
#	0xa72804	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).ProcessDelete+0xc4				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:275
#	0x92e5d3	github.com/longhorn/longhorn-instance-manager/pkg/imrpc._ProcessManagerService_ProcessDelete_Handler+0x1b3	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/imrpc/imrpc.pb.go:1302
#	0x8fc102	google.golang.org/grpc.(*Server).processUnaryRPC+0xe02								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1372
#	0x90120b	google.golang.org/grpc.(*Server).handleStream+0xfeb								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1783
#	0x8fa078	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x58							/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1016

ProcessDelete keeps getting called because we are trying to detach volumes. Each server goroutine that is supposed to answer the request gets hung on the deadlocked RWMutex.

From 445 MiB to 549 MiB in an hour and a half or so.

ejweber · 2024-02-15T22:35:57Z

@mergify backport v1.6.x

mergify · 2024-02-15T22:37:03Z

backport v1.6.x

✅ Backports have been created

#398 Don't send on unbuffered channel while holding a lock (backport #397) has been created for branch v1.6.x

innobead · 2024-02-16T05:22:42Z

@mergify backport v1.5.x v1.4.x

mergify · 2024-02-16T05:22:53Z

backport v1.5.x v1.4.x

✅ Backports have been created

#400 Don't send on unbuffered channel while holding a lock (backport #397) has been created for branch v1.5.x
#401 Don't send on unbuffered channel while holding a lock (backport #397) has been created for branch v1.4.x

innobead · 2024-02-16T05:26:12Z

In this case, this means the lock usage of the process manager (or generally) should not be locked and unlocked in a very short time to prevent any potential scalability issue.

Don't send on unbuffered channel while holding a lock

9f14426

Longhorn 7919 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 7919-checkMountPointStatusForEngine-deadlock branch from 73660c6 to 9f14426 Compare February 13, 2024 21:41

ejweber mentioned this pull request Feb 13, 2024

[BUG] Deadlock is possible in v1.6.0 instance manager longhorn/longhorn#7919

Closed

ejweber marked this pull request as ready for review February 13, 2024 22:17

ejweber requested a review from a team as a code owner February 13, 2024 22:17

PhanLe1010 approved these changes Feb 14, 2024

View reviewed changes

james-munson approved these changes Feb 14, 2024

View reviewed changes

PhanLe1010 merged commit ecd20d6 into longhorn:master Feb 15, 2024
5 of 6 checks passed

mergify bot mentioned this pull request Feb 15, 2024

Don't send on unbuffered channel while holding a lock (backport #397) #398

Merged

ejweber mentioned this pull request Feb 15, 2024

[BACKPORT][v1.6.1][BUG] Deadlock is possible in v1.6.0 instance manager longhorn/longhorn#7920

Closed

This was referenced Feb 16, 2024

Don't send on unbuffered channel while holding a lock (backport #397) #400

Merged

Don't send on unbuffered channel while holding a lock (backport #397) #401

Merged

This was referenced Feb 16, 2024

[BACKPORT][v1.5.4][BUG] Deadlock is possible in v1.6.0 instance manager longhorn/longhorn#7941

Closed

[BACKPORT][v1.4.5][BUG] Deadlock is possible in v1.6.0 instance manager longhorn/longhorn#7940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't send on unbuffered channel while holding a lock #397

Don't send on unbuffered channel while holding a lock #397

ejweber commented Feb 13, 2024

ejweber commented Feb 13, 2024 •

edited

Loading

PhanLe1010 left a comment •

edited

Loading

james-munson left a comment

ejweber commented Feb 15, 2024 •

edited

Loading

ejweber commented Feb 15, 2024

mergify bot commented Feb 15, 2024 •

edited

Loading

innobead commented Feb 16, 2024

mergify bot commented Feb 16, 2024 •

edited

Loading

innobead commented Feb 16, 2024

Don't send on unbuffered channel while holding a lock #397

Don't send on unbuffered channel while holding a lock #397

Conversation

ejweber commented Feb 13, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

ejweber commented Feb 13, 2024 • edited Loading

PhanLe1010 left a comment • edited Loading

Choose a reason for hiding this comment

james-munson left a comment

Choose a reason for hiding this comment

ejweber commented Feb 15, 2024 • edited Loading

ejweber commented Feb 15, 2024

mergify bot commented Feb 15, 2024 • edited Loading

✅ Backports have been created

innobead commented Feb 16, 2024

mergify bot commented Feb 16, 2024 • edited Loading

✅ Backports have been created

innobead commented Feb 16, 2024

ejweber commented Feb 13, 2024 •

edited

Loading

PhanLe1010 left a comment •

edited

Loading

ejweber commented Feb 15, 2024 •

edited

Loading

mergify bot commented Feb 15, 2024 •

edited

Loading

mergify bot commented Feb 16, 2024 •

edited

Loading