proof of concept: stop sequentially, fixing racy shutdown #1464

Groxx · 2025-10-15T00:23:05Z

Probable fix for our racy shutdown issue: just wait for the coroutine's goroutine to close, before stopping the next.

Since this is fairly likely to lead to getting stuck in incorrect code, it also reports failures to stop "quickly" (1 second), which is quite permissive as ~all correct code should finish on the order of a millisecond or less. Extreme system load could still cause this to be exceeded, but it retains correct behavior even then so I think we can address those log/metric complaints if/when they occur.

This also includes an earlier proof-of-concept to fix what might be a source of our goroutine leaks: using the wrong context writes to the wrong aboutToBlock channel, causing more writes to occur on it than reads, causing a deadlock.

The good(?) news is that fixing our shutdown logic reveals this deadlock in the ContextMisuse test - tests get stuck, and a ctrl-\ shows its stacks. Merging in the wrong-chan-write fix resolves that, and tests run quickly/normally again.

Thanks to some internal tooling to look for deadlocked goroutines, this test: > appears to go.uber.org/cadence/internal.(*WorkflowTestSuiteUnitTest).Test_ContextMisuse appears to trigger a goroutine leak. Since it's not doing anything *impossible* for users to do (it's just using the wrong context arg), it seems like this implies we've got incorrect goroutine shutdown code. I'm not yet fully confident that this is The Fix and that it does not cause other issues, but it seems pretty likely so far. I don't know if this would explain the in-production leaks we've seen in the past but not reproduced (those seem to be query-triggered somehow), but it *does* seem quite likely that this has leaked somewhere. Just hopefully also caused errors that pointed to bad code which led to a fix.

… leak searches

this relies on changes merged in from maybe-bad-yield, as otherwise some tests deadlock. Signed-off-by: Steven L <imgroxx@gmail.com>

Groxx added 4 commits September 23, 2025 15:58

fix a test-only coroutine leak, as it is causing some noise for other…

82db4f3

… leak searches

Merge branch 'maybe-bad-yield' into stop-coroutines

0ec622f

probable fix for racy goroutine shutdown: leak instead

eeb648b

this relies on changes merged in from maybe-bad-yield, as otherwise some tests deadlock. Signed-off-by: Steven L <imgroxx@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proof of concept: stop sequentially, fixing racy shutdown #1464

proof of concept: stop sequentially, fixing racy shutdown #1464

Groxx commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

proof of concept: stop sequentially, fixing racy shutdown #1464

Are you sure you want to change the base?

proof of concept: stop sequentially, fixing racy shutdown #1464

Conversation

Groxx commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant