New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Formalize Barrier behavior during waiting #3464

Closed

p-datadog wants to merge 20 commits into DataDog:master from p-datadog:barrier-refactor

Contributor

p-datadog commented Feb 16, 2024

What does this PR do?

This PR modifies the Barrier class to have, in my opinion, more understandable/easier to reason about timeout behavior during the waits.

The new behavior is as follows:

A timeout can be defined for the barrier. Any wait will wait up to this timeout from the moment the barrier is created (and its corresponding operation, presumably, started).
If a timeout is provided for the individual wait operation, the wait will be no longer than this timeout (but would not end later than barrier creation + barrier timeout).
If multiple waits are done, they will continue waiting until the barrier timeout elapses from barrier construction (or until the individual wait timeouts, whichever is shorter).

Previous behavior:

A timeout can be defined for the barrier, but this timeout is applied to the wait operations, not to the barrier operation.
When waiting, the wait timeout replaces the barrier timeout, even if the wait timeout is longer.
The wait is always waiting up to the full selected timeout, even if the barrier was created (and thus barrier operation started) a long time ago.
Upon completion of (any) wait, the barrier is marked as having been lifted.

Examples

Start operation with 2 second timeout, after 1 second, wait for it.

New behavior: after 1 second of waiting, control returns to caller.
Previous behavior: after 2 seconds of waiting, control returns to caller.

Start operation with 2 second timeout, after 3 seconds, wait for it.

New behavior: control returns instantly.
Previous behavior: after 2 seconds of waiting, control returns to caller.

Start operation with 2 second timeout, wait for it with 1 second timeout,
after 1 second second thread waits with 1 second timeout.

New behavior: Each waiting thread waits 1 second, then continues.
Previous behavior: First thread waits 1 second, second thread continues immediately.

Motivation:

The existing behavior is I think confusing and, although probably does not present issues given the class is used in a very limited way by dd-trace-rb, may cause issues if the usage of Barrier was expanded.

Additional Notes:

This PR is part of the fix for thread leaks which will be in a follow-up PR.

How to test the change?

This PR adds several unit tests which check elapsed time of wait operations. The shortest timeouts are set to 0.25 seconds, which is hopefully sufficient to not produce flakiness in CI while keeping the test runs quick.

The entire barrier_spec.rb takes 7 seconds to execute currently.

p added 3 commits

February 16, 2024 13:17


          move tests to own file

aa67ab1


          verify timing

5fcde1f


          define barrier behavior

bd5ded8

p-datadog requested a review from a team as a code owner

February 16, 2024 18:37

github-actions bot added the core label

p added 17 commits

February 16, 2024 13:40


          rubocop

7d63a93

ugh

f1f889d


          are you smart?

7f9195d


          nope

7e7be77


          nope


          Revert "nope"

This reverts commit 8106692.


          Revert "nope"

0c8c034

This reverts commit 7e7be77.


          Revert "are you smart?"

This reverts commit 7f9195d.


          Revert "ugh"

7d13ea6

This reverts commit f1f889d.


          Revert "rubocop"

dd784af

This reverts commit 7d63a93.


          rubocop

3b7a6f1


          rubocop

163d08c


          steep

1c46f57


          rubocop


          note

d54d815


          rubocop

3e91f3b


          fix locking post cherry-picks

0f2fb9e

ivoanjo reviewed

View reviewed changes

Member

ivoanjo left a comment

I've left a few comments!

I think in general this class has been problematic (e.g. see https://github.com/DataDog/ruby-guild/issues/50 and #3002 and #2990 ), so I left several suggestions for further simplifying it, which hopefully would also help in reducing the flakiness.

lib/datadog/core/remote/component.rb

    
                          @once = false

                          @timeout = timeout

                          @lifted = false

                          @deadline = timeout && Core::Utils::Time.get_time + timeout

Member

ivoanjo Feb 19, 2024

So I'm not sure this is the correct intended semantics.

From reading the code, my understanding is that the intention is that the timeout gets configured at component initialization time, but the actual timeout would only start counting later, when the worker gets lazily initialized.

I guess cc @lloeki can help clarify :)

lib/datadog/core/remote/component.rb

Comment on lines +120 to +122

+                        # If timeout is provided in this call, waits up to the smaller of
+                        # the provided wait timeout and the barrier timeout since the
+                        # barrier was created.

Member

ivoanjo Feb 19, 2024

Since this is a private class, and we never actually need to provide this second timeout in production, I'm wondering if we should remove this feature until we need it.

lib/datadog/core/remote/component.rb

Comment on lines +124 to +126

+                        # If neither wait timeout is provided in this call nor the
+                        # barrier timeout in the constructor, waits indefinitely until
+                        # the barrier is lifted.

Member

ivoanjo Feb 19, 2024

I wonder if we should remove this feature too, since again, the one user of this class doesn't actually use this functionality ;)

lib/datadog/core/remote/component.rb

Comment on lines +152 to +153

		# workaround for rubocop & steep trying to mangle the code
		if timeout && timeout.public_send(:<=, 0)

Member

ivoanjo Feb 19, 2024

If Rubocop is being annoying, I suggest using an inline # rubocop:disable instead of making the code "worse" just to make it happy.

I'm curious about the issue with steep -- maybe it's something we can fix in the type signatures? 🤔

Contributor Author

p-datadog Feb 22, 2024

steep refuses to permit timeout <= 0 on account of timeout allegedly being nil (the preceding check for it being truthy doesn't count, apparently).

lib/datadog/core/remote/component.rb

Comment on lines 158 to +167

    
                            # - starting with Ruby 3.2, ConditionVariable#wait returns nil on

                            #   timeout and an integer otherwise

                            # - before Ruby 3.2, ConditionVariable returns itself

                            # so we have to rely on @once having been set

                            if RUBY_VERSION >= '3.2'

                              lifted = @condition.wait(@mutex, timeout)

                            else

                              @condition.wait(@mutex, timeout)

                              lifted = @once

                            end

                            # so we have to rely on @lifted having been set

                            lifted = if RUBY_VERSION >= '3.2'

                                       !!@condition.wait(@mutex, timeout)

                                     else

                                       @condition.wait(@mutex, timeout)

                                       @lifted

                                     end

Member

ivoanjo Feb 19, 2024

Minor: To be honest, I'm not sure it's worth keeping a multiple line comment + 2 implementations, rather than just using the one implementation that works on all Rubies ;)

lib/datadog/core/remote/component.rb

-                          @once ||= true
+                          @mutex.synchronize do
+                            @once ||= true

Member

ivoanjo Feb 19, 2024

Is this still correct? Should this be @lifted?

spec/datadog/core/remote/component/barrier_spec.rb

+                  end
+                  before do
+                    record

Member

ivoanjo Feb 19, 2024

Minor: I believe this can be simplified by making record a let!

Contributor Author

p-datadog commented Feb 22, 2024

Thank you for the review @ivoanjo , I discussed this with @lloeki and this PR definitely needs more work.

At a minimum, my assumption of when the "work" process starts was incorrect; to continue in the spirit of this PR I would need to add a method that would start the "timeout from work start". Construction time isn't at all correct to take for the moment when work starts.

Second issue that @lloeki raised was that in the existing implementation, waiting threads wouldn't all be unblocked together, but in the implementation proposed in this PR they would be (as soon as the work timeout expires). Unblocking all threads at the same time was not an appealing idea.

And generally, whether the overall approach taken by this PR was the desired one (i.e. whether timeouts should count from when the work started or from when the waits started), is up for debate.

I believe this PR is orthogonal to the thread leak issue. I opened it because I thought it would make the code easier to reason about but given the feedback perhaps the barrier can be left as is for now and I'll be focusing on the start/stop calls right now.

p-datadog closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core