Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default timeout on curl handles #66

Merged
merged 7 commits into from
Nov 15, 2023
Merged

Set default timeout on curl handles #66

merged 7 commits into from
Nov 15, 2023

Conversation

dmehala
Copy link
Collaborator

@dmehala dmehala commented Oct 19, 2023

Why?

Now, drain guarantee after being executed all ongoing requests are processed (unless requests are posted while draining 👀) and if for whatever reason the request could not be processed, like timeout, the reason will be logged.

Content:

  • Curl default implementation set a 2s default timeout.
  • Drain waits indefinitely until the predicate is valid.

@dmehala dmehala requested review from cgilmour and dgoffredo October 19, 2023 13:19
@dgoffredo
Copy link
Contributor

class HTTPClient Use Cases

  1. Envoy

    Doesn't use libcurl. Request timeout of 1 second is hard-coded in implementation.
    drain() does nothing because the implementation is not threaded.

  2. nginx-datadog

    The motivating use case. The purpose of drain() is to guarantee that dd-trace-cpp I/O does not block worker shutdown for too long.
    It uses the drain() timeout hard-coded into class DatadogAgent.

  3. class DatadogAgent

    Flushes batches of trace segments every two seconds (by default). It's error handler logs, and its success handler interprets the response for sampling configuration.
    drain() is called in the destructor, so that when the tracer/application is shutting down, the last two seconds of trace segments have a chance to be sent to the agent.
    DatadogAgent doesn't specify a timeout in its requests, because timeouts are not part of the HTTPClient interface.

  4. Custom code using DatadogAgent

    Aside from Envoy and nginx-datadog, a hypothetical customer might instrument their own C++ code using dd-trace-cpp. They would be instantiating class Tracer, which implicitly instantiates class DatadogAgent, which implicitly creates a class Curl.
    Currently, the only thing that such a user can control is the flush interval. They can't adjust libcurl's request timeouts, and they can't adjust the timeout for drain().

  5. Custom code using HTTPClient

    A even more hypothetical customer might use class Curl for other purposes, in addition to sending traces to the agent. Such a customer can send requests to any endpoint, but cannot specify a timeout. They can drain() with a deadline, though.

Use cases in light of these changes

It doesn't make any difference to Envoy.

The only difference to nginx-datadog is that timeouts at any time will log an error, instead of quietly failing only when drain() is called. This is the reason for your changes.

The difference for class DatadogAgent is that it can no longer guarantee an upper bound on how long its destructor might block, though with knowledge of the particular implementation of the underlying HTTPClient, it will typically be two seconds.

The difference for direct use of class Curl is that there is now a hard-coded timeout, and drain() will block for no longer than that timeout.

My Recommendation

Let's add all of the bells and whistles.

Here's an idea:

  • Add a std::chrono::steady_clock::time_point deadline parameter to HTTPClient::post.
  • Convey that deadline in struct CurlImpl::Request. Don't convert it into a CURLOPT_TIMEOUT_MS until the moment before the handle is added to the multi handle by the run() thread.
  • Add an int request_timeout_milliseconds = 2000 member to struct DatadogAgentConfig, and a corresponding std::chrono::steady_clock::duration request_timeout to FinalizedDatadogAgentConfig.
  • Add an int shutdown_timeout_milliseconds = 2000 member to struct DatadogAgentConfig, and a corresponding std::chrono::steady_clock::duration shutdown_timeout to FinalizedDatadogAgentConfig.
  • Restore the deadline for drain().
  • Have ~DatadogAgent() use its configured shutdown timeout to calculate a deadline for drain().
  • Have DatadogAgent::flush() use its configured request timeout to calculate a deadline for HTTPClient::post.

That's the maximalist solution. What you have proposed in this PR is an improvement over the status quo in the sense that agent request taking more than two seconds log an error, which is good, but the omission of a deadline from the signature of drain() degrades the appearance of multi-purpose interfaces. I haven't thought of a middle ground between what you propose and the all-out modifications described above.

What do you think?

@@ -464,6 +473,7 @@ CURLMcode CurlImpl::log_on_error(CURLMcode result) {
void CurlImpl::run() {
int num_messages_remaining;
CURLMsg *message;
const int max_wait_milliseconds = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post and libcurl both will awaken the thread if necessary. What purpose does having this timeout less than the request timeout serve?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's from another experimentation. Back when my hypothesis was drain() has been called but the thread is waiting for an event from curl or that previous 10s timeout. In that scenario, drain() would be useless. Is it happening ? I don't know 🤷‍♂️ One thing I am sure of is it is not here on purpose, so good and catch and:

  • revert

@dmehala
Copy link
Collaborator Author

dmehala commented Oct 20, 2023

Custom code using DatadogAgent
They would be instantiating class Tracer, which implicitly instantiates class DatadogAgent, which implicitly creates a class Curl

The consensus is that library is only used by us for webserver integrations. I do not know if we want to be more than that, but if we do, it will be useful to let users of the library make use of their own transport library. That's a discussion for another time.

Your recommandation is going deeper and makes everything more explicit. I do like it even if I am still not convinced the removal of a deadline in drain() degrades the appearance of multi-purpose interfaces.

All in all, my only suggestion is to increase the shutdown timeout. It must always be bigger than the request timeout to actually give a chance for the request to be processed. Otherwise, in some cases (shutdown) it will makes no difference with our current solution. I may be wrong, what's your thought?

@dgoffredo
Copy link
Contributor

You're right that this code is used only for reverse proxy integrations. My question is: what usage do we want the class interfaces to allow?

If our only concern is nginx-datadog (Envoy does its own thing), then an unbounded drain with a hard-coded libcurl timeout is fine. It's what nginx wants, and it's one timeout instead of two.

But reading the code without knowledge of nginx, drain and post timeouts mean different things. The post timeout says "I consider it an error if the peer takes longer than this." The drain timeout says "I promise not to block shutdown for longer than this, regardless of how long peers might take." The reason I initially had a timeout in drain but not in post is that the thing I wish to guarantee is "I won't block shutdown for too long." I did not intend to guarantee that an HTTP request takes less than some time, because in a sense it can take as long as it wants. Realistically, as you point out, taking longer than the drain deadline means that we drop traces on shutdown, and there's also the fact that if requests regularly take longer than the flush interval then we will eventually run out of memory...

All in all, my only suggestion is to increase the shutdown timeout. It must always be bigger than the request timeout to actually give a chance for the request to be processed. Otherwise, in some cases (shutdown) it will makes no difference with our current solution.

The request timeout is not an upper bound on typical request latency, but a line that we draw in the sand to say "if it's taking this long, something is wrong and the application owner might want to know about it." When we set the drain timeout to be similar to or even less than the post timeout, all we're saying is that we are more interested in shutting down quickly than we are in flushing traces in the subset of cases where things are taking way too long.

I do like it even if I am still not convinced the removal of a deadline in drain() degrades the appearance of multi-purpose interfaces.

That fine, it's a vague claim on my part.

Here's a compromise between what I recommend and what you have proposed:

  • Keep drain without a timeout. The contract is now "wait until everything is done or times out."
  • Add the post deadline in the way I described. Now a caller must specify a timeout, and so the "until it times out" in drain is understood, assuming the caller of drain is the same as the caller of post, which is true and a reasonable assumption for the future.

src/datadog/curl.cpp Outdated Show resolved Hide resolved
@pr-commenter
Copy link

pr-commenter bot commented Nov 2, 2023

Benchmarks

Benchmark execution time: 2023-11-13 12:45:10

Comparing candidate commit 5fad53e in PR branch dmehala/fix-curl-issue with baseline commit f9e4217 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 0 metrics, 0 unstable metrics.

scenario:BM_TraceTinyCCSource

  • 🟩 execution_time [-3.794ms; -3.714ms] or [-4.615%; -4.517%]

src/datadog/http_client.h Outdated Show resolved Hide resolved
test/test_curl.cpp Outdated Show resolved Hide resolved
test/test_curl.cpp Outdated Show resolved Hide resolved
test/test_curl.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@dgoffredo dgoffredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since my suggestions were merged, I approve of these changes.

Hopefully the merge conflicts aren't too much trouble to sort out.

dmehala and others added 6 commits November 13, 2023 11:53
- curl default implementation set a 2s default timeout
- drain waits indefinitely until the predicate is valid
- Need to talk about `clock`
- `easy_setopt_timeout`  ->  `easy_setopt_timeout_ms`
- remove `HTTPClient::Deadline` alias
- pass `Clock` into `CurlImpl`
    - `finalize_config(const TracerConfig&, const Clock&)`
    - `FinalizedTracerConfig::clock`
    - `finalize_config(const DatadogAgentConfig&, const Clock&)`
    - `default_http_client(const shared_ptr<Logger>&, const Clock&)`
    - `FinalizedDatadogAgentConfig::clock`
    - `DatadogAgent::DatadogAgent`
- more specific `Error::CURL_DEADLINE_EXCEEDED_BEFORE_REQUEST_START`,
  and a little context in the error message
- `dummy_timeout`  ->  `dummy_deadline`
- `TEST_CASE` for `Error::CURL_DEADLINE_EXCEEDED_BEFORE_REQUEST_START`
@dmehala dmehala force-pushed the dmehala/fix-curl-issue branch from f52ddad to abed062 Compare November 13, 2023 12:10
Copy link
Contributor

@dgoffredo dgoffredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmehala dmehala merged commit 7038b02 into main Nov 15, 2023
2 checks passed
@dmehala dmehala deleted the dmehala/fix-curl-issue branch November 15, 2023 13:05
cataphract pushed a commit to cataphract/dd-trace-cpp that referenced this pull request Mar 28, 2024
- Remove dependency on `nginx-version-info` to build the module locally. Only `build-in-docker` need it now.
- Rework CMake target.
- Update README.
- Remove support for old NGINX versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants