Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't send on unbuffered channel while holding a lock #397

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented Feb 13, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#7919

What this PR does / why we need it:

There is a potential deadlock caused by pm.getProcessesToUpdateConditions. See the linked issue for details.

We only need the ProcessManager lock while we are iterating through the Process map. We should not continue to hold it while sending on an unbuffered channel.

Longhorn 7919

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber force-pushed the 7919-checkMountPointStatusForEngine-deadlock branch from 73660c6 to 9f14426 Compare February 13, 2024 21:41
@ejweber
Copy link
Contributor Author

ejweber commented Feb 13, 2024

I tested this by following the reproduce from longhorn/longhorn#7919 (comment) with a modified instance-manager that had these changes AND MountCheckInterval of 250 ms. So far, we survived 125 iterations without deadlock, but I will continue to let it run.

@ejweber ejweber marked this pull request as ready for review February 13, 2024 22:17
@ejweber ejweber requested a review from a team as a code owner February 13, 2024 22:17
Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

A side question, do we observe memory and CPU high consumption once there are many go routine stuck in this case? Maybe it could be triggered if we leave the deadlock for a long time and the go routine starts building up.

Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@ejweber
Copy link
Contributor Author

ejweber commented Feb 15, 2024

A side question, do we observe memory and CPU high consumption once there are many go routine stuck in this case? Maybe it could be triggered if we leave the deadlock for a long time and the go routine starts building up.

Good question. I put my cluster into the stuck state and am monitoring it. For now CPU is stable and low, but it looks like memory is slowly leaking in the instance manager pod as we build up goroutines like the following:

goroutine profile: total 334
77 @ 0x43e7ae 0x44fcf8 0x44fccf 0x46da45 0xa73385 0xa73354 0xa72805 0x92e5d4 0x8fc103 0x90120c 0x8fa079 0x471a41
#	0x46da44	sync.runtime_SemacquireRWMutexR+0x24										/usr/local/go/src/runtime/sema.go:82
#	0xa73384	sync.(*RWMutex).RLock+0x84											/usr/local/go/src/sync/rwmutex.go:71
#	0xa73353	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).findProcess+0x53				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:350
#	0xa72804	github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).ProcessDelete+0xc4				/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:275
#	0x92e5d3	github.com/longhorn/longhorn-instance-manager/pkg/imrpc._ProcessManagerService_ProcessDelete_Handler+0x1b3	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/imrpc/imrpc.pb.go:1302
#	0x8fc102	google.golang.org/grpc.(*Server).processUnaryRPC+0xe02								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1372
#	0x90120b	google.golang.org/grpc.(*Server).handleStream+0xfeb								/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1783
#	0x8fa078	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x58							/go/src/github.com/longhorn/longhorn-instance-manager/vendor/google.golang.org/grpc/server.go:1016

ProcessDelete keeps getting called because we are trying to detach volumes. Each server goroutine that is supposed to answer the request gets hung on the deadlocked RWMutex.

image

From 445 MiB to 549 MiB in an hour and a half or so.

@PhanLe1010 PhanLe1010 merged commit ecd20d6 into longhorn:master Feb 15, 2024
5 of 6 checks passed
@ejweber
Copy link
Contributor Author

ejweber commented Feb 15, 2024

@mergify backport v1.6.x

Copy link

mergify bot commented Feb 15, 2024

backport v1.6.x

✅ Backports have been created

@innobead
Copy link
Member

@mergify backport v1.5.x v1.4.x

Copy link

mergify bot commented Feb 16, 2024

backport v1.5.x v1.4.x

✅ Backports have been created

@innobead
Copy link
Member

In this case, this means the lock usage of the process manager (or generally) should not be locked and unlocked in a very short time to prevent any potential scalability issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants