heterogeneous trace context extraction #72

dgoffredo · 2023-11-16T19:40:45Z

These changes remedy the following scenario:

A tracer is configured to inject in the W3C (tracecontext, i.e. OpenTelemetry) style, perhaps among other styles. The same tracer is configured to extract first in one or more non-W3C styles (DATADOG and/or B3), and then in the W3C style.

When an incoming request contains the desired non-W3C headers, such as X-Datadog-Trace-Id, then trace context is extracted from them, and any W3C headers in the request are ignored.

This is problematic when the tracestate W3C header contains non-Datadog information. We have ways of putting Datadog information into tracestate, but there is not a way to put arbitrary tracestate into Datadog headers.

So, this pull request adds the following logic: If we extract due to non-W3C headers, but we are configured also to extract W3C headers, then look for the tracestate header and keep some of its data if the trace ID from the corresponding traceparent header matches the trace ID extracted from the non-W3C header.

See the included unit test for clarification.

Note to the reviewer: There are some unnecessary changes included in this pull request, per discussions we've had previously about where the extraction code should live. I didn't go all-in on the reorganization, but extraction_util is a start. See the individual commits if you want to subtract that noise.

pr-commenter · 2023-11-16T19:48:41Z

Benchmarks

Benchmark execution time: 2023-11-17 20:49:17

Comparing candidate commit e69b0e8 in PR branch david.goffredo/heterogeneous-trace-context-extraction with baseline commit 9781acf in branch main.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 0 metrics, 0 unstable metrics.

scenario:BM_TraceTinyCCSource

🟥 execution_time [+2.593ms; +2.668ms] or [+3.310%; +3.406%]

This is used by bin/check to do a "dry run" format.

dgoffredo · 2023-11-17T01:33:06Z

I'll run these against Zach's parametric system tests tomorrow.

dmehala

There might be an issue with the merge algorithm. Otherwise, LGTM . Thank you for extraction_util:

NOTES

The structure of the code managing extraction according to propagation style reflects too much the transition we have internally. Let me explain: I haven't read the RFCs, but it seems that we wanted to process one type of propagation at a time, then now we decide to process several at the same time to the point they can be interchangeable. IMO it would be easier and more efficient to process and validate all propagation style in one pass.

Another thing, I noticed that the code to extract traceparent uses std::regex. Although very useful, this library is notoriously known to be very slow. Since traceparent format is easy, we can achieve the same result without it. Here is a quick benchmark.

dmehala · 2023-11-17T08:08:29Z

src/datadog/tracer.cpp

    }
+    extracted_contexts.push_back(std::move(*data));
+    extracted_contexts.back().headers_examined = audited_reader.entries_found;


Would it be possible extract to add those examined headers instead? It's a bit weird to have an extract function and the caller need to enrich the result.

That would be possible, yes. That was my initial design.

I talked myself into the current design because:

I didn't want to add code to the extractors saying "I looked at this header" when the DictReader could just do that for them.

Tracking the headers examined is for error messages. The extractors are responsible for creating their own error messages, but I did not want them to contribute information to the error messages of their caller.

As I read this, it doesn't sound very convincing. Suffice it to say, I looked into my c++ crystal ball and didn't like the code that notes the headers within the extractors, and so did this two-step instead.

Maybe adding style and headers_examined as members to ExtractedData was a mistake. Unnecessary, for sure. It doesn't bother me yet, though.

Maybe I just wanted an excuse to make further use of the DictReader interface... as it stands, the implementation of visit is yet more dead code.

You are free to change my mind about this. But again, I'll leave this as is for this PR.

dmehala · 2023-11-17T10:32:51Z

src/datadog/extracted_data.h

@@ -33,6 +34,13 @@ struct ExtractedData {
  // `additional_datadog_w3c_tracestate` is null.
  // `additional_datadog_w3c_tracestate` is used for the `W3C` injection style.
  Optional<std::string> additional_datadog_w3c_tracestate;
+  // `style` is the extraction style used to obtain this `ExtractedData`. It's
+  // for diagnostics.
+  Optional<PropagationStyle> style;


Suggested change

Optional<PropagationStyle> style;

PropagationStyle style;

The propagation style is not inferred from the input, it can't be missing. That's why I do think it semantically makes sense to use PropagationStyle::NONE instead. Plus, it's easier to manipulate.

At first I had this non-Optional. Then I noticed the case where all of the ExtractedData had neither trace ID nor parent ID. What then is the style? You suggest NONE, but that denotes an explicit style that can be configured; it is not a placeholder for the absence of a style.

NONE's existence is questionable, but it is specified both in the current RFC that I am following, as well as in its earlier incarnation that I implemented previously (1, 2).

Maybe what you're noticing is the awkwardness of ExtractedData::style and ExtractedData::headers_examined. After all, they are not the extracted data.

Or maybe what you're noticing is the now dual purpose of ExtractedData: It used to be the result of examining headers for one propagation style. Now it is either that, or the result of mergeing zero or more styles. Perhaps ExtractedData is the wrong type for this latter notion.

This also speaks to your summarizing comment that this hybrid "one at a time, but not really" design is questionable. I see your point there, but I think that the "one at a time, with special merging logic" is precisely what is needed to satisfy the RFC. For what it's worth, it's also what Ruby does. As we continue to discover heterogeneous propagation scenarios "in the wild," we will likely modify this design again.

I'm going to leave this as is for this PR, but you make a good point. Let's improve this in a future PR.

dmehala · 2023-11-17T08:26:49Z

test/test_tracer.cpp

+     {{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"},
+      {"tracestate", "dd=foo:bar,lol=wut"}},
+     {{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-000000000000002a-01"},


Out of curiosity, why the parent-id changed from 00f067aa0ba902b7 to 000000000000002a?

The parent ID that was extracted is the ID of some remote span (in the calling service). Then we create a new span, with its own ID, and inject its information. So, the injected span ID is the ID that we generated for our one local span.

It's 0x2a because of MockIDGenerator.

dmehala · 2023-11-17T08:35:03Z

test/test_tracer.cpp

+  Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()};
+
+  MockDictReader reader{test_case.extracted_headers};
+  auto span = tracer.extract_span(reader);


I do know how Catch2 will handle the exception thrown by MockIDGenerator (a more suggestive name could beForbiddenIDGenerator?) if any, I guess it will be better to handle it with REQUIRE_NOTHROW.

Expected<Span> span; REQUIRE_NOTHROW(span = tracer.extract_span(reader))

The assumption "this code won't generate a trace ID" is not what is under test here.

Yours is a good idea, but that code won't compile due to the definitions of Span and Expected.

Here's the best that I could come up with:

Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()}; MockDictReader reader{test_case.extracted_headers}; MockDictWriter writer; REQUIRE_NOTHROW([&]() { auto span = tracer.extract_span(reader); REQUIRE(span); span->inject(writer); }()); REQUIRE(writer.items == test_case.expected_injected_headers);

It bothers me that now we're asserting the non-throwness of three statements instead of the one expression tracer.extract_span(reader).

Another way around this is to use a level of indirection:

Optional<Expected<Span>> maybe_span; REQUIRE_NOTHROW(maybe_span = tracer.extract_span(reader)); REQUIRE(maybe_span.has_value()); REQUIRE(*maybe_span); Span& span = **maybe_span;

There are probably other ways to put a REQUIRE_NOTHROW in there, but since that behavior is not what is under test, I'll leave the code as is.

dmehala · 2023-11-17T08:48:59Z

src/datadog/tracer.cpp

@@ -2,11 +2,13 @@

 #include <algorithm>
 #include <cassert>
+#include <sstream>


Good catch.

dmehala · 2023-11-17T11:11:35Z

src/datadog/extraction_util.cpp

+  if (result.style == PropagationStyle::W3C) {
+    return result;
+  }


It seems there is an implicit assumption on contexts order in the vector. From my understanding with std::vector<ExtractedData> context{w3c_extracted, datadog_extracted} with w3c_extracted and datatdog_extracted respectively being ExtractedData from extract_w3c() and extract_datadog() then datadog headers well not be merged. Here is a failing test case:

{__LINE__, "tracestate from subsequent style", {PropagationStyle::W3C, PropagationStyle::DATADOG}, ///< I switched W3C & DATADOG order {PropagationStyle::W3C}, {{"x-datadog-trace-id", "48"}, {"x-datadog-parent-id", "64"}, {"x-datadog-origin", "Kansas"}, {"x-datadog-sampling-priority", "2"}, {"traceparent", "00-00000000000000000000000000000030-0000000000000040-01"}, {"tracestate", "competitor=stuff,dd=o:Nebraska;s:1;ah:choo"}}, // origin is different {{"traceparent", "00-00000000000000000000000000000030-000000000000002a-01"}, {"tracestate", "dd=s:2;o:Kansas;ah:choo,competitor=stuff"}}},

I does work most of the time because extract_datadog is called before extract_w3c. Is it the intended behavior am I missing something?

In the case that you illustrate, the traceparent and tracestate headers will be accepted by the W3C style, which is listed first, and then all x-datadog-* headers will be ignored. This is indeed the intended behavior. The merge operation is not "symmetric" with respect to the relative ordering of W3C and non-W3C styles.

The sole initial goal of these changes, as I understand them, is to make sure that information from the tracestate header is always propagated when it is known to be related to whichever trace we decide to be part of.

In your example, the actual and expected injected headers would be traceparent unchanged, and then tracestate mostly unchanged (perhaps the dd items could be reordered and others could appear, and the dd= would be first).

dgoffredo

Thanks for the thoughtful review, Damien.

I haven't read the RFCs, but it seems that we wanted to process one type of propagation at a time, then now we decide to process several at the same time to the point they can be interchangeable. IMO it would be easier and more efficient to process and validate all propagation style in one pass.

The logic specified in the RFC, as I understand it, is that still we are trying styles in order, but there's a new caveat with regard to the tracestate header, under certain conditions.

DictReader has both lookup and visit because I anticipated a "look at all of the headers in one pass as they're parsed" use case, which is how nginx does things under the hood. However, Envoy exposes a map-like interface. The resulting compromise is that dd-trace-cpp uses a map-like interface, and the map is constructed upfront if necessary (e.g. in nginx).

So, there is little benefit to processing all styles in one pass, and doing so would obscure the fact that the relative priority of configured styles is still important.

Another thing, I noticed that the code to extract traceparent uses std::regex. Although very useful, this library is notoriously known to be very slow. Since traceparent format is easy, we can achieve the same result without it. Here is a quick benchmark.

On the contrary -- std::regex is not slow, it's fast. So that solves that problem!

I considered writing the traceparent parser without regex. tracestate, x-datadog-tags, and most other parsing done by the library is by-hand, without regex. I chose to use regex for traceparent for the following reasons:

The nine lines of commented regex pattern at the top of extract_traceparent describe the grammar of what is being parsed, and the parser has no choice but to comply with the grammar.
The numeric parsing done after the regex match does not need to be checked for errors, because the correctness of the hex encoding is enforced by the regular expression.
Every time I see a hand-written parser, I curse under my breath and reverse engineer the regular expression that it is implementing. There are usually bugs.

Most other parsing done by the library is of the kind "X-separated list of zero or more Y." This does not lend itself particularly well to regex, and is straightforward enough to parse with some careful looping. traceparent, on the other hand, has a known length prefix with a known structure, and so lends itself well to parsing via regex.

I appreciate your working benchmark and alternative non-regex implementation. The alternative implementation (and the adapted regex-based one) does not parse the hex integers, and so the notational burden of that error handling is omitted from your example. Also, your code (which is impressive) requires that I learn the simple machine model of the parser, explain to myself the numeric offsets, and simulate the parse for a variety of examples until I am confident that it does what I like.

Compare this to the regex version, which is pretty much just a single regex pattern.

If you can prove that Envoy or nginx are burdened by this use of std::regex, then I will happily accept your alternative parser. For now, though, let's use std::regex.

I have some immediate changes based on your review that I will push shortly.

Then I'll mull over your review a little longer to see what else I might change.

Then I'll check these changes against Datadog's internal system tests.

Then I'll merge.

dgoffredo · 2023-11-17T19:23:53Z

src/datadog/extracted_data.h

@@ -33,6 +34,13 @@ struct ExtractedData {
  // `additional_datadog_w3c_tracestate` is null.
  // `additional_datadog_w3c_tracestate` is used for the `W3C` injection style.
  Optional<std::string> additional_datadog_w3c_tracestate;
+  // `style` is the extraction style used to obtain this `ExtractedData`. It's
+  // for diagnostics.
+  Optional<PropagationStyle> style;


At first I had this non-Optional. Then I noticed the case where all of the ExtractedData had neither trace ID nor parent ID. What then is the style? You suggest NONE, but that denotes an explicit style that can be configured; it is not a placeholder for the absence of a style.

NONE's existence is questionable, but it is specified both in the current RFC that I am following, as well as in its earlier incarnation that I implemented previously (1, 2).

Maybe what you're noticing is the awkwardness of ExtractedData::style and ExtractedData::headers_examined. After all, they are not the extracted data.

Or maybe what you're noticing is the now dual purpose of ExtractedData: It used to be the result of examining headers for one propagation style. Now it is either that, or the result of mergeing zero or more styles. Perhaps ExtractedData is the wrong type for this latter notion.

This also speaks to your summarizing comment that this hybrid "one at a time, but not really" design is questionable. I see your point there, but I think that the "one at a time, with special merging logic" is precisely what is needed to satisfy the RFC. For what it's worth, it's also what Ruby does. As we continue to discover heterogeneous propagation scenarios "in the wild," we will likely modify this design again.

I'm going to leave this as is for this PR, but you make a good point. Let's improve this in a future PR.

dgoffredo · 2023-11-17T19:34:17Z

src/datadog/extraction_util.cpp

+  if (result.style == PropagationStyle::W3C) {
+    return result;
+  }


In the case that you illustrate, the traceparent and tracestate headers will be accepted by the W3C style, which is listed first, and then all x-datadog-* headers will be ignored. This is indeed the intended behavior. The merge operation is not "symmetric" with respect to the relative ordering of W3C and non-W3C styles.

The sole initial goal of these changes, as I understand them, is to make sure that information from the tracestate header is always propagated when it is known to be related to whichever trace we decide to be part of.

In your example, the actual and expected injected headers would be traceparent unchanged, and then tracestate mostly unchanged (perhaps the dd items could be reordered and others could appear, and the dd= would be first).

dgoffredo · 2023-11-17T19:35:43Z

src/datadog/extraction_util.h

+// `ExtractedData`. The order of the elements of `contexts` must correspond to
+// the order of the configured extraction propagation styles.


Note this last sentence in the function contract.

dgoffredo · 2023-11-17T19:36:08Z

src/datadog/tracer.cpp

@@ -2,11 +2,13 @@

 #include <algorithm>
 #include <cassert>
+#include <sstream>


Good catch.

dgoffredo · 2023-11-17T19:42:11Z

src/datadog/tracer.cpp

    }
+    extracted_contexts.push_back(std::move(*data));
+    extracted_contexts.back().headers_examined = audited_reader.entries_found;


That would be possible, yes. That was my initial design.

I talked myself into the current design because:

I didn't want to add code to the extractors saying "I looked at this header" when the DictReader could just do that for them.

Tracking the headers examined is for error messages. The extractors are responsible for creating their own error messages, but I did not want them to contribute information to the error messages of their caller.

As I read this, it doesn't sound very convincing. Suffice it to say, I looked into my c++ crystal ball and didn't like the code that notes the headers within the extractors, and so did this two-step instead.

Maybe adding style and headers_examined as members to ExtractedData was a mistake. Unnecessary, for sure. It doesn't bother me yet, though.

Maybe I just wanted an excuse to make further use of the DictReader interface... as it stands, the implementation of visit is yet more dead code.

You are free to change my mind about this. But again, I'll leave this as is for this PR.

dgoffredo · 2023-11-17T20:01:23Z

test/test_tracer.cpp

+     {{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"},
+      {"tracestate", "dd=foo:bar,lol=wut"}},
+     {{"traceparent", "00-4bf92f3577b34da6a3ce929d0e0e4736-000000000000002a-01"},


The parent ID that was extracted is the ID of some remote span (in the calling service). Then we create a new span, with its own ID, and inject its information. So, the injected span ID is the ID that we generated for our one local span.

It's 0x2a because of MockIDGenerator.

dgoffredo · 2023-11-17T20:13:08Z

test/test_tracer.cpp

+  Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()};
+
+  MockDictReader reader{test_case.extracted_headers};
+  auto span = tracer.extract_span(reader);


The assumption "this code won't generate a trace ID" is not what is under test here.

Yours is a good idea, but that code won't compile due to the definitions of Span and Expected.

Here's the best that I could come up with:

Tracer tracer{*finalized_config, std::make_shared<MockIDGenerator>()}; MockDictReader reader{test_case.extracted_headers}; MockDictWriter writer; REQUIRE_NOTHROW([&]() { auto span = tracer.extract_span(reader); REQUIRE(span); span->inject(writer); }()); REQUIRE(writer.items == test_case.expected_injected_headers);

It bothers me that now we're asserting the non-throwness of three statements instead of the one expression tracer.extract_span(reader).

Another way around this is to use a level of indirection:

Optional<Expected<Span>> maybe_span; REQUIRE_NOTHROW(maybe_span = tracer.extract_span(reader)); REQUIRE(maybe_span.has_value()); REQUIRE(*maybe_span); Span& span = **maybe_span;

There are probably other ways to put a REQUIRE_NOTHROW in there, but since that behavior is not what is under test, I'll leave the code as is.

dgoffredo · 2023-11-17T22:45:26Z

When I run the parametric system tests locally, I get 97 errors of the following type:

----------------------------------------------------------------------------------- Captured stdout setup ------------------------------------------------------------------------------------
[]
____________________________________________ ERROR at setup of Test_Span_Sampling.test_multi_rule_independent_rate_limiters_sss013[library_env0] _____________________________________________
[gw11] linux -- Python 3.10.12 /home/david/src/system-tests/venv/bin/python
Could not create docker network 'apm_shared_tests_network_6e867f', see the log file '/tmp/tmpzactx35z'

The referenced log file (/tmp/tmpzactx35z, in this case) never exists.

I'm going to proceed with a release of this code.

dgoffredo added 7 commits November 16, 2023 20:22

more context in extraction error messages

94b1a08

(unrelated) update library version in example

351f315

move extraction code into its own component

2248327

remove copy/pasted comments

9251602

eats, shoots, and leaves

7a625cc

bin/format should forward arguments to clang-format

1a1bd9d

This is used by bin/check to do a "dry run" format.

default propagation style is now [Datadog, W3C]

a0b50db

dgoffredo force-pushed the david.goffredo/heterogeneous-trace-context-extraction branch from 33c2926 to 89a565c Compare November 17, 2023 01:23

dgoffredo marked this pull request as ready for review November 17, 2023 01:32

dgoffredo requested a review from dmehala November 17, 2023 01:32

dgoffredo added 2 commits November 16, 2023 20:37

heterogeneous extraction, untested

ec3f06e

unit test for heterogeneous extraction

fa83395

dgoffredo force-pushed the david.goffredo/heterogeneous-trace-context-extraction branch from 89a565c to fa83395 Compare November 17, 2023 01:38

dmehala approved these changes Nov 17, 2023

View reviewed changes

dgoffredo added 3 commits November 17, 2023 14:52

missed a spot when moving code into extraction_util

95bb86d

don't lie to your teammates

d7ea3b6

trade a conditional jump for an addition

e69b0e8

dgoffredo commented Nov 17, 2023

View reviewed changes

dgoffredo merged commit 4080fa5 into main Nov 17, 2023
2 checks passed

dgoffredo deleted the david.goffredo/heterogeneous-trace-context-extraction branch November 17, 2023 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heterogeneous trace context extraction #72

heterogeneous trace context extraction #72

dgoffredo commented Nov 16, 2023 •

edited

Loading

pr-commenter bot commented Nov 16, 2023 •

edited

Loading

dgoffredo commented Nov 17, 2023

dmehala left a comment •

edited

Loading

dmehala Nov 17, 2023 •

edited

Loading

dgoffredo Nov 17, 2023

dmehala Nov 17, 2023

dgoffredo Nov 17, 2023

dmehala Nov 17, 2023

dgoffredo Nov 17, 2023

dmehala Nov 17, 2023

dgoffredo Nov 17, 2023

dmehala Nov 17, 2023

dgoffredo Nov 17, 2023

dmehala Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo left a comment

dgoffredo Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo Nov 17, 2023

dgoffredo commented Nov 17, 2023

		// `ExtractedData`. The order of the elements of `contexts` must correspond to
		// the order of the configured extraction propagation styles.

heterogeneous trace context extraction #72

heterogeneous trace context extraction #72

Conversation

dgoffredo commented Nov 16, 2023 • edited Loading

pr-commenter bot commented Nov 16, 2023 • edited Loading

Benchmarks

scenario:BM_TraceTinyCCSource

dgoffredo commented Nov 17, 2023

dmehala left a comment • edited Loading

Choose a reason for hiding this comment

NOTES

dmehala Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgoffredo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgoffredo commented Nov 17, 2023

dgoffredo commented Nov 16, 2023 •

edited

Loading

pr-commenter bot commented Nov 16, 2023 •

edited

Loading

dmehala left a comment •

edited

Loading

dmehala Nov 17, 2023 •

edited

Loading