-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formalize Barrier behavior during waiting #3464
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments!
I think in general this class has been problematic (e.g. see https://github.com/DataDog/ruby-guild/issues/50 and #3002 and #2990 ), so I left several suggestions for further simplifying it, which hopefully would also help in reducing the flakiness.
@once = false | ||
@timeout = timeout | ||
@lifted = false | ||
@deadline = timeout && Core::Utils::Time.get_time + timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not sure this is the correct intended semantics.
From reading the code, my understanding is that the intention is that the timeout
gets configured at component initialization time, but the actual timeout
would only start counting later, when the worker gets lazily initialized.
I guess cc @lloeki can help clarify :)
# If timeout is provided in this call, waits up to the smaller of | ||
# the provided wait timeout and the barrier timeout since the | ||
# barrier was created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a private class, and we never actually need to provide this second timeout
in production, I'm wondering if we should remove this feature until we need it.
# If neither wait timeout is provided in this call nor the | ||
# barrier timeout in the constructor, waits indefinitely until | ||
# the barrier is lifted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should remove this feature too, since again, the one user of this class doesn't actually use this functionality ;)
# workaround for rubocop & steep trying to mangle the code | ||
if timeout && timeout.public_send(:<=, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Rubocop is being annoying, I suggest using an inline # rubocop:disable
instead of making the code "worse" just to make it happy.
I'm curious about the issue with steep
-- maybe it's something we can fix in the type signatures? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
steep refuses to permit timeout <= 0
on account of timeout allegedly being nil
(the preceding check for it being truthy doesn't count, apparently).
# - starting with Ruby 3.2, ConditionVariable#wait returns nil on | ||
# timeout and an integer otherwise | ||
# - before Ruby 3.2, ConditionVariable returns itself | ||
# so we have to rely on @once having been set | ||
if RUBY_VERSION >= '3.2' | ||
lifted = @condition.wait(@mutex, timeout) | ||
else | ||
@condition.wait(@mutex, timeout) | ||
lifted = @once | ||
end | ||
# so we have to rely on @lifted having been set | ||
lifted = if RUBY_VERSION >= '3.2' | ||
!!@condition.wait(@mutex, timeout) | ||
else | ||
@condition.wait(@mutex, timeout) | ||
@lifted | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: To be honest, I'm not sure it's worth keeping a multiple line comment + 2 implementations, rather than just using the one implementation that works on all Rubies ;)
|
||
@once ||= true | ||
@mutex.synchronize do | ||
@once ||= true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still correct? Should this be @lifted
?
end | ||
|
||
before do | ||
record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I believe this can be simplified by making record
a let!
Thank you for the review @ivoanjo , I discussed this with @lloeki and this PR definitely needs more work. At a minimum, my assumption of when the "work" process starts was incorrect; to continue in the spirit of this PR I would need to add a method that would start the "timeout from work start". Construction time isn't at all correct to take for the moment when work starts. Second issue that @lloeki raised was that in the existing implementation, waiting threads wouldn't all be unblocked together, but in the implementation proposed in this PR they would be (as soon as the work timeout expires). Unblocking all threads at the same time was not an appealing idea. And generally, whether the overall approach taken by this PR was the desired one (i.e. whether timeouts should count from when the work started or from when the waits started), is up for debate. I believe this PR is orthogonal to the thread leak issue. I opened it because I thought it would make the code easier to reason about but given the feedback perhaps the barrier can be left as is for now and I'll be focusing on the start/stop calls right now. |
What does this PR do?
This PR modifies the Barrier class to have, in my opinion, more understandable/easier to reason about timeout behavior during the waits.
The new behavior is as follows:
Previous behavior:
Examples
New behavior: after 1 second of waiting, control returns to caller.
Previous behavior: after 2 seconds of waiting, control returns to caller.
New behavior: control returns instantly.
Previous behavior: after 2 seconds of waiting, control returns to caller.
after 1 second second thread waits with 1 second timeout.
New behavior: Each waiting thread waits 1 second, then continues.
Previous behavior: First thread waits 1 second, second thread continues immediately.
Motivation:
The existing behavior is I think confusing and, although probably does not present issues given the class is used in a very limited way by dd-trace-rb, may cause issues if the usage of Barrier was expanded.
Additional Notes:
This PR is part of the fix for thread leaks which will be in a follow-up PR.
How to test the change?
This PR adds several unit tests which check elapsed time of wait operations. The shortest timeouts are set to 0.25 seconds, which is hopefully sufficient to not produce flakiness in CI while keeping the test runs quick.
The entire barrier_spec.rb takes 7 seconds to execute currently.