Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heterogeneous trace context extraction #72

Merged
merged 12 commits into from
Nov 17, 2023

Conversation

dgoffredo
Copy link
Contributor

@dgoffredo dgoffredo commented Nov 16, 2023

These changes remedy the following scenario:

A tracer is configured to inject in the W3C (tracecontext, i.e. OpenTelemetry) style, perhaps among other styles. The same tracer is configured to extract first in one or more non-W3C styles (DATADOG and/or B3), and then in the W3C style.

When an incoming request contains the desired non-W3C headers, such as X-Datadog-Trace-Id, then trace context is extracted from them, and any W3C headers in the request are ignored.

This is problematic when the tracestate W3C header contains non-Datadog information. We have ways of putting Datadog information into tracestate, but there is not a way to put arbitrary tracestate into Datadog headers.

So, this pull request adds the following logic: If we extract due to non-W3C headers, but we are configured also to extract W3C headers, then look for the tracestate header and keep some of its data if the trace ID from the corresponding traceparent header matches the trace ID extracted from the non-W3C header.

See the included unit test for clarification.

Note to the reviewer: There are some unnecessary changes included in this pull request, per discussions we've had previously about where the extraction code should live. I didn't go all-in on the reorganization, but extraction_util is a start. See the individual commits if you want to subtract that noise.

@pr-commenter
Copy link

pr-commenter bot commented Nov 16, 2023

Benchmarks

Benchmark execution time: 2023-11-17 20:49:17

Comparing candidate commit e69b0e8 in PR branch david.goffredo/heterogeneous-trace-context-extraction with baseline commit 9781acf in branch main.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 0 metrics, 0 unstable metrics.

scenario:BM_TraceTinyCCSource

  • 🟥 execution_time [+2.593ms; +2.668ms] or [+3.310%; +3.406%]

@dgoffredo dgoffredo force-pushed the david.goffredo/heterogeneous-trace-context-extraction branch from 33c2926 to 89a565c Compare November 17, 2023 01:23
@dgoffredo dgoffredo marked this pull request as ready for review November 17, 2023 01:32
@dgoffredo dgoffredo requested a review from dmehala November 17, 2023 01:32
@dgoffredo
Copy link
Contributor Author

I'll run these against Zach's parametric system tests tomorrow.

@dgoffredo dgoffredo force-pushed the david.goffredo/heterogeneous-trace-context-extraction branch from 89a565c to fa83395 Compare November 17, 2023 01:38
Copy link
Collaborator

@dmehala dmehala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be an issue with the merge algorithm. Otherwise, LGTM :shipit: . Thank you for extraction_util:

NOTES

The structure of the code managing extraction according to propagation style reflects too much the transition we have internally. Let me explain: I haven't read the RFCs, but it seems that we wanted to process one type of propagation at a time, then now we decide to process several at the same time to the point they can be interchangeable. IMO it would be easier and more efficient to process and validate all propagation style in one pass.

Another thing, I noticed that the code to extract traceparent uses std::regex. Although very useful, this library is notoriously known to be very slow. Since traceparent format is easy, we can achieve the same result without it. Here is a quick benchmark.

}
extracted_contexts.push_back(std::move(*data));
extracted_contexts.back().headers_examined = audited_reader.entries_found;
Copy link
Collaborator

@dmehala dmehala Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible extract to add those examined headers instead? It's a bit weird to have an extract function and the caller need to enrich the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be possible, yes. That was my initial design.

I talked myself into the current design because:

  • I didn't want to add code to the extractors saying "I looked at this header" when the DictReader could just do that for them.
  • Tracking the headers examined is for error messages. The extractors are responsible for creating their own error messages, but I did not want them to contribute information to the error messages of their caller.

As I read this, it doesn't sound very convincing. Suffice it to say, I looked into my c++ crystal ball and didn't like the code that notes the headers within the extractors, and so did this two-step instead.

Maybe adding style and headers_examined as members to ExtractedData was a mistake. Unnecessary, for sure. It doesn't bother me yet, though.

Maybe I just wanted an excuse to make further use of the DictReader interface... as it stands, the implementation of visit is yet more dead code.

You are free to change my mind about this. But again, I'll leave this as is for this PR.

@@ -33,6 +34,13 @@ struct ExtractedData {
// `additional_datadog_w3c_tracestate` is null.
// `additional_datadog_w3c_tracestate` is used for the `W3C` injection style.
Optional<std::string> additional_datadog_w3c_tracestate;
// `style` is the extraction style used to obtain this `ExtractedData`. It's
// for diagnostics.
Optional<PropagationStyle> style;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Optional<PropagationStyle> style;
PropagationStyle style;

The propagation style is not inferred from the input, it can't be missing. That's why I do think it semantically makes sense to use PropagationStyle::NONE instead. Plus, it's easier to manipulate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I had this non-Optional. Then I noticed the case where all of the ExtractedData had neither trace ID nor parent ID. What then is the style? You suggest NONE, but that denotes an explicit style that can be configured; it is not a placeholder for the absence of a style.

NONE's existence is questionable, but it is specified both in the current RFC that I am following, as well as in its earlier incarnation that I implemented previously (1, 2).

Maybe what you're noticing is the awkwardness of ExtractedData::style and ExtractedData::headers_examined. After all, they are not the extracted data.

Or maybe what you're noticing is the now dual purpose of ExtractedData: It used to be the result of examining headers for one propagation style. Now it is either that, or the result of mergeing zero or more styles. Perhaps ExtractedData is the wrong type for this latter notion.

This also speaks to your summarizing comment that this hybrid "one at a time, but not really" design is questionable. I see your point there, but I think that the "one at a time, with special merging logic" is precisely what is needed to satisfy the RFC. For what it's worth, it's also what Ruby does. As we continue to discover heterogeneous propagation scenarios "in the wild," we will likely modify this design again.

I'm going to leave this as is for this PR, but you make a good point. Let's improve this in a future PR.

Comment on lines +1216 to +1218
{{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"},
{"tracestate", "dd=foo:bar,lol=wut"}},
{{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-000000000000002a-01"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why the parent-id changed from 00f067aa0ba902b7 to 000000000000002a?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent ID that was extracted is the ID of some remote span (in the calling service). Then we create a new span, with its own ID, and inject its information. So, the injected span ID is the ID that we generated for our one local span.

It's 0x2a because of MockIDGenerator.

Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()};

MockDictReader reader{test_case.extracted_headers};
auto span = tracer.extract_span(reader);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do know how Catch2 will handle the exception thrown by MockIDGenerator (a more suggestive name could beForbiddenIDGenerator?) if any, I guess it will be better to handle it with REQUIRE_NOTHROW.

Expected<Span> span;
REQUIRE_NOTHROW(span = tracer.extract_span(reader))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption "this code won't generate a trace ID" is not what is under test here.

Yours is a good idea, but that code won't compile due to the definitions of Span and Expected.

Here's the best that I could come up with:

  Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()};

  MockDictReader reader{test_case.extracted_headers};
  MockDictWriter writer;
  REQUIRE_NOTHROW([&]() {
    auto span = tracer.extract_span(reader);
    REQUIRE(span);
    span->inject(writer);
  }());

  REQUIRE(writer.items == test_case.expected_injected_headers);

It bothers me that now we're asserting the non-throwness of three statements instead of the one expression tracer.extract_span(reader).

Another way around this is to use a level of indirection:

Optional<Expected<Span>> maybe_span;
REQUIRE_NOTHROW(maybe_span = tracer.extract_span(reader));
REQUIRE(maybe_span.has_value());
REQUIRE(*maybe_span);
Span& span = **maybe_span;

There are probably other ways to put a REQUIRE_NOTHROW in there, but since that behavior is not what is under test, I'll leave the code as is.

@@ -2,11 +2,13 @@

#include <algorithm>
#include <cassert>
#include <sstream>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment on lines +260 to +262
if (result.style == PropagationStyle::W3C) {
return result;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is an implicit assumption on contexts order in the vector. From my understanding with std::vector<ExtractedData> context{w3c_extracted, datadog_extracted} with w3c_extracted and datatdog_extracted respectively being ExtractedData from extract_w3c() and extract_datadog() then datadog headers well not be merged. Here is a failing test case:

    {__LINE__, "tracestate from subsequent style",
     {PropagationStyle::W3C, PropagationStyle::DATADOG}, ///< I switched W3C & DATADOG order
     {PropagationStyle::W3C},
     {{"x-datadog-trace-id", "48"}, {"x-datadog-parent-id", "64"},
      {"x-datadog-origin", "Kansas"}, {"x-datadog-sampling-priority", "2"},
      {"traceparent", "00-00000000000000000000000000000030-0000000000000040-01"},
      {"tracestate", "competitor=stuff,dd=o:Nebraska;s:1;ah:choo"}}, // origin is different
     {{"traceparent", "00-00000000000000000000000000000030-000000000000002a-01"},
      {"tracestate", "dd=s:2;o:Kansas;ah:choo,competitor=stuff"}}},

I does work most of the time because extract_datadog is called before extract_w3c. Is it the intended behavior am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that you illustrate, the traceparent and tracestate headers will be accepted by the W3C style, which is listed first, and then all x-datadog-* headers will be ignored. This is indeed the intended behavior. The merge operation is not "symmetric" with respect to the relative ordering of W3C and non-W3C styles.

The sole initial goal of these changes, as I understand them, is to make sure that information from the tracestate header is always propagated when it is known to be related to whichever trace we decide to be part of.

In your example, the actual and expected injected headers would be traceparent unchanged, and then tracestate mostly unchanged (perhaps the dd items could be reordered and others could appear, and the dd= would be first).

Copy link
Contributor Author

@dgoffredo dgoffredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful review, Damien.


I haven't read the RFCs, but it seems that we wanted to process one type of propagation at a time, then now we decide to process several at the same time to the point they can be interchangeable. IMO it would be easier and more efficient to process and validate all propagation style in one pass.

The logic specified in the RFC, as I understand it, is that still we are trying styles in order, but there's a new caveat with regard to the tracestate header, under certain conditions.

DictReader has both lookup and visit because I anticipated a "look at all of the headers in one pass as they're parsed" use case, which is how nginx does things under the hood. However, Envoy exposes a map-like interface. The resulting compromise is that dd-trace-cpp uses a map-like interface, and the map is constructed upfront if necessary (e.g. in nginx).

So, there is little benefit to processing all styles in one pass, and doing so would obscure the fact that the relative priority of configured styles is still important.


Another thing, I noticed that the code to extract traceparent uses std::regex. Although very useful, this library is notoriously known to be very slow. Since traceparent format is easy, we can achieve the same result without it. Here is a quick benchmark.

On the contrary -- std::regex is not slow, it's fast. So that solves that problem!

I considered writing the traceparent parser without regex. tracestate, x-datadog-tags, and most other parsing done by the library is by-hand, without regex. I chose to use regex for traceparent for the following reasons:

  • The nine lines of commented regex pattern at the top of extract_traceparent describe the grammar of what is being parsed, and the parser has no choice but to comply with the grammar.
  • The numeric parsing done after the regex match does not need to be checked for errors, because the correctness of the hex encoding is enforced by the regular expression.
  • Every time I see a hand-written parser, I curse under my breath and reverse engineer the regular expression that it is implementing. There are usually bugs.

Most other parsing done by the library is of the kind "X-separated list of zero or more Y." This does not lend itself particularly well to regex, and is straightforward enough to parse with some careful looping. traceparent, on the other hand, has a known length prefix with a known structure, and so lends itself well to parsing via regex.

I appreciate your working benchmark and alternative non-regex implementation. The alternative implementation (and the adapted regex-based one) does not parse the hex integers, and so the notational burden of that error handling is omitted from your example. Also, your code (which is impressive) requires that I learn the simple machine model of the parser, explain to myself the numeric offsets, and simulate the parse for a variety of examples until I am confident that it does what I like.

Compare this to the regex version, which is pretty much just a single regex pattern.

If you can prove that Envoy or nginx are burdened by this use of std::regex, then I will happily accept your alternative parser. For now, though, let's use std::regex.


I have some immediate changes based on your review that I will push shortly.

Then I'll mull over your review a little longer to see what else I might change.

Then I'll check these changes against Datadog's internal system tests.

Then I'll merge.

@@ -33,6 +34,13 @@ struct ExtractedData {
// `additional_datadog_w3c_tracestate` is null.
// `additional_datadog_w3c_tracestate` is used for the `W3C` injection style.
Optional<std::string> additional_datadog_w3c_tracestate;
// `style` is the extraction style used to obtain this `ExtractedData`. It's
// for diagnostics.
Optional<PropagationStyle> style;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I had this non-Optional. Then I noticed the case where all of the ExtractedData had neither trace ID nor parent ID. What then is the style? You suggest NONE, but that denotes an explicit style that can be configured; it is not a placeholder for the absence of a style.

NONE's existence is questionable, but it is specified both in the current RFC that I am following, as well as in its earlier incarnation that I implemented previously (1, 2).

Maybe what you're noticing is the awkwardness of ExtractedData::style and ExtractedData::headers_examined. After all, they are not the extracted data.

Or maybe what you're noticing is the now dual purpose of ExtractedData: It used to be the result of examining headers for one propagation style. Now it is either that, or the result of mergeing zero or more styles. Perhaps ExtractedData is the wrong type for this latter notion.

This also speaks to your summarizing comment that this hybrid "one at a time, but not really" design is questionable. I see your point there, but I think that the "one at a time, with special merging logic" is precisely what is needed to satisfy the RFC. For what it's worth, it's also what Ruby does. As we continue to discover heterogeneous propagation scenarios "in the wild," we will likely modify this design again.

I'm going to leave this as is for this PR, but you make a good point. Let's improve this in a future PR.

Comment on lines +260 to +262
if (result.style == PropagationStyle::W3C) {
return result;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that you illustrate, the traceparent and tracestate headers will be accepted by the W3C style, which is listed first, and then all x-datadog-* headers will be ignored. This is indeed the intended behavior. The merge operation is not "symmetric" with respect to the relative ordering of W3C and non-W3C styles.

The sole initial goal of these changes, as I understand them, is to make sure that information from the tracestate header is always propagated when it is known to be related to whichever trace we decide to be part of.

In your example, the actual and expected injected headers would be traceparent unchanged, and then tracestate mostly unchanged (perhaps the dd items could be reordered and others could appear, and the dd= would be first).

Comment on lines +92 to +93
// `ExtractedData`. The order of the elements of `contexts` must correspond to
// the order of the configured extraction propagation styles.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this last sentence in the function contract.

@@ -2,11 +2,13 @@

#include <algorithm>
#include <cassert>
#include <sstream>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

}
extracted_contexts.push_back(std::move(*data));
extracted_contexts.back().headers_examined = audited_reader.entries_found;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be possible, yes. That was my initial design.

I talked myself into the current design because:

  • I didn't want to add code to the extractors saying "I looked at this header" when the DictReader could just do that for them.
  • Tracking the headers examined is for error messages. The extractors are responsible for creating their own error messages, but I did not want them to contribute information to the error messages of their caller.

As I read this, it doesn't sound very convincing. Suffice it to say, I looked into my c++ crystal ball and didn't like the code that notes the headers within the extractors, and so did this two-step instead.

Maybe adding style and headers_examined as members to ExtractedData was a mistake. Unnecessary, for sure. It doesn't bother me yet, though.

Maybe I just wanted an excuse to make further use of the DictReader interface... as it stands, the implementation of visit is yet more dead code.

You are free to change my mind about this. But again, I'll leave this as is for this PR.

Comment on lines +1216 to +1218
{{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"},
{"tracestate", "dd=foo:bar,lol=wut"}},
{{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-000000000000002a-01"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent ID that was extracted is the ID of some remote span (in the calling service). Then we create a new span, with its own ID, and inject its information. So, the injected span ID is the ID that we generated for our one local span.

It's 0x2a because of MockIDGenerator.

Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()};

MockDictReader reader{test_case.extracted_headers};
auto span = tracer.extract_span(reader);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption "this code won't generate a trace ID" is not what is under test here.

Yours is a good idea, but that code won't compile due to the definitions of Span and Expected.

Here's the best that I could come up with:

  Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()};

  MockDictReader reader{test_case.extracted_headers};
  MockDictWriter writer;
  REQUIRE_NOTHROW([&]() {
    auto span = tracer.extract_span(reader);
    REQUIRE(span);
    span->inject(writer);
  }());

  REQUIRE(writer.items == test_case.expected_injected_headers);

It bothers me that now we're asserting the non-throwness of three statements instead of the one expression tracer.extract_span(reader).

Another way around this is to use a level of indirection:

Optional<Expected<Span>> maybe_span;
REQUIRE_NOTHROW(maybe_span = tracer.extract_span(reader));
REQUIRE(maybe_span.has_value());
REQUIRE(*maybe_span);
Span& span = **maybe_span;

There are probably other ways to put a REQUIRE_NOTHROW in there, but since that behavior is not what is under test, I'll leave the code as is.

@dgoffredo
Copy link
Contributor Author

When I run the parametric system tests locally, I get 97 errors of the following type:

----------------------------------------------------------------------------------- Captured stdout setup ------------------------------------------------------------------------------------
[]
____________________________________________ ERROR at setup of Test_Span_Sampling.test_multi_rule_independent_rate_limiters_sss013[library_env0] _____________________________________________
[gw11] linux -- Python 3.10.12 /home/david/src/system-tests/venv/bin/python
Could not create docker network 'apm_shared_tests_network_6e867f', see the log file '/tmp/tmpzactx35z'

The referenced log file (/tmp/tmpzactx35z, in this case) never exists.

I'm going to proceed with a release of this code.

@dgoffredo dgoffredo merged commit 4080fa5 into main Nov 17, 2023
2 checks passed
@dgoffredo dgoffredo deleted the david.goffredo/heterogeneous-trace-context-extraction branch November 17, 2023 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants