proof of concept: stop sequentially, fixing racy shutdown #1464
+116
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Probable fix for our racy shutdown issue: just wait for the coroutine's goroutine to close, before stopping the next.
Since this is fairly likely to lead to getting stuck in incorrect code, it also reports failures to stop "quickly" (1 second), which is quite permissive as ~all correct code should finish on the order of a millisecond or less. Extreme system load could still cause this to be exceeded, but it retains correct behavior even then so I think we can address those log/metric complaints if/when they occur.
This also includes an earlier proof-of-concept to fix what might be a source of our goroutine leaks: using the wrong context writes to the wrong aboutToBlock channel, causing more writes to occur on it than reads, causing a deadlock.
The good(?) news is that fixing our shutdown logic reveals this deadlock in the
ContextMisuse
test - tests get stuck, and actrl-\
shows its stacks. Merging in the wrong-chan-write fix resolves that, and tests run quickly/normally again.