chore: add Open Telemetry attributes to grpc spans #698

sergerad · 2025-02-16T04:11:38Z

Relates to #681.

Use tower::TraceLayer in the tonic server builder for block producer rpc and store rpc
Add attributes to rpc spans based on HTTP request values and Open Telemetry standards

sergerad · 2025-02-16T04:13:53Z

crates/block-producer/src/server.rs

+            )
+            .on_failure(
+                |_error: GrpcFailureClass, _latency: Duration, _span: &tracing::Span| todo!(),
+            );


@Mirko-von-Leipzig I agree that this is probably the correct abstraction to be using here (as you mentioned in this comment).

Any suggestions on how to test this locally? TY

Its definitely a bit awkward to test. There is the text exporter which at least lets you debug/inspect the manually.

If we do want to uni/integration test these things, we'll probably have to write our own exporter - e.g. something to aggregate the data which we can then assert against. I'm unsure why this isn't already available tbh.

What I've been doing in the meantime is using https://www.honeycomb.io/ with a free account to test things out. There's some setup info in the wip guide here.

I essentially start a local node, configure otel to use the honeycomb endpoint and then generate traffic using the miden-client integration test against my local node.

This might be a pita though - let me know if you run into issues, or if you can think of something smarter :D

Actually here we go: https://docs.rs/opentelemetry_sdk/latest/opentelemetry_sdk/testing/trace/index.html

They do actually have test infra :)

Thanks just getting back to this now. Will have a crack at integrating those trace test capabilities in the unit tests.

unit tests

Or integration tests rather. Which would involve allowing the node to be configured with different SpanExporters to allow the test one to be used.

Yeah I imagine this may take some more consideration, given that the exporters are (always?) global.

Might be good to simply explore the options and then discuss what is actually testable without much fuss.

Integration tests are a sore point at the moment - and likely will get worse before getting better. e.g. this thread. Its not quite trivial to just spin up a node at the moment, but depending on what you discover here maybe that's something we should aim at.

@Mirko-von-Leipzig This is how I think the testing would have to work:

Update a few functions to take in an impl SpanExporter.

Update node and faucet main.rs to construct either a test SpanExporter or an actual one based on flags or toml.

Implement tests that call the main functionality and assert on the SpanData coming out of the test SpanExporter. These would probably be the existing integration tests but I haven't had a look yet.

But I don't think it is worthwhile. I have seen Rust projects do similar things where they run unit or integration tests and assert against log lines produced by the stack. I have generally avoided this kind of thing in the past, but I can imagine situations where its worth the maintenance cost and brittleness (tests being coupled to log lines).

I think it would become practical to just test trace output changes on the dev deployment/cluster once the trace scaffolding/impl stabilizes for the node. Rather than testing every change locally or in CI.

I have connected to honeycomb successfully and eye-balled the info! output coming from my latest changes, e.g.:

2025-02-21T06:22:59.610587Z INFO block-producer.rpc/SubmitProvenTransaction: miden_node_block_producer::server: crates/block-producer/src/server.rs:218: request: POST /block_p roducer.Api/SubmitProvenTransaction {"te": "trailers", "content-type": "application/grpc", "traceparent": "00-20616281d08ea7970f0701e71dd2ef80-453f6f8340af0947-01", "tracestate": "", "user-agent": "tonic/0.12.3"}

Still getting my head around exactly what we want to achieve and whether the approach I've added so far fulfils that. The example above only has grpc related headers so I think I'll need to swap

let trace_layer = TraceLayer::new_for_grpc()

to

let trace_layer = TraceLayer::new_for_http()

But I don't think it is worthwhile. I have seen Rust projects do similar things where they run unit or integration tests and assert against log lines produced by the stack. I have generally avoided this kind of thing in the past, but I can imagine situations where its worth the maintenance cost and brittleness (tests being coupled to log lines).

I agree; isn't worth it especially not at our current project maturity level. I think I was hoping we could do something like:

#[test] async fn block_builder_trace() { let store = mock_store(); let exporter = TestSpanExporter::new()...; exporter.register_for_this_test_only(); BlockBuilder::build_block(...).await; assert_eq!(exporter.spans, expected_spans); }

But I suspect its not trivially possible. Though maybe there is a way to have the global exporter registered as TestSpanExporter and somehow associate the spans we get from this test only.

The example above only has grpc related headers so I think I'll need to swap

Yeah. I think the important things to get in are the server/client IP addresses which will (probably) only be available for http I imagine? Long term we'll also want to add interesting headers and/or CORS information maybe, but I'm unsure.

The http fn only differs from the grpc one w.r.t the error type handled by error callback:

/// Create a new [`TraceLayer`] using [`ServerErrorsAsFailures`] which supports classifying /// regular HTTP responses based on the status code. pub fn new_for_http() -> Self { Self { make_classifier: SharedClassifier::new(ServerErrorsAsFailures::default()), // vs: make_classifier: SharedClassifier::new(GrpcErrorsAsFailures::default()),

Think the best we can do is uri.host. We won't be able to see source/client IP unless its put into a header by a proxy or some other part of this stack.

I suppose there is still value in the tower_http::trace::TraceLayer because we can register callbacks here:

.on_request(|request: &http::Request<_>, _span: &tracing::Span| { tracing::info!( "request: {} {} {} {:?}", request.method(), request.uri.host request.uri().host().unwrap_or("NOHOST"), request.uri().path(), request.headers() ); }) .on_response( |response: &http::Response<_>, latency: Duration, _span: &tracing::Span| { tracing::info!("response: {} {:?}", response.status(), latency); }, ) .on_failure(|error: GrpcFailureClass, _latency: Duration, _span: &tracing::Span| { tracing::error!("error: {}", error);

Here is the diff in error type for the on failure callback:

/// The failure class for [`ServerErrorsAsFailures`]. #[derive(Debug)] pub enum ServerErrorsFailureClass { /// A response was classified as a failure with the corresponding status. StatusCode(StatusCode), /// A response was classified as an error with the corresponding error description. Error(String), } ... #[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)] pub struct StatusCode(NonZeroU16); ...

vs

/// The failure class for [`GrpcErrorsAsFailures`]. #[derive(Debug)] pub enum GrpcFailureClass { /// A gRPC response was classified as a failure with the corresponding status. Code(std::num::NonZeroI32), /// A gRPC response was classified as an error with the corresponding error description. Error(String), }

So might as well go with the http one I guess.

I'll look at the unit test approach you mentioned above and see if its possible.

sergerad · 2025-02-22T05:21:45Z

crates/block-producer/src/mempool/tests.rs

+            .count(),
+        1,
+    );
+}


@Mirko-von-Leipzig I added this unit test to illustrate the otel trace test capability. It doesn't relate to the functionality added in this PR. This was just the easiest unit test I could find to impl. The block producer rpc stack doesn't have mocks etc atm.

LMK if you want to keep/rm this test or move it to another PR with other tests etc.

Let's leave it for a separate PR.

Is it possible to separate by test itself - or do we need to run these kind of tests sequentially to ensure we get the spans recorded that we expect?

As in, running tests in parallel probably muddles the span exporter unless we somehow mark the spans with a test ID?

But this looks promising, we can figure out the specifics in the other PR/issue.

crates/block-producer/src/server.rs

crates/store/src/server/mod.rs

sergerad · 2025-02-27T07:39:36Z

crates/utils/src/tracing/span_ext.rs

+            .extensions()
+            .get::<tonic::transport::server::TcpConnectInfo>()
+            .and_then(tonic::transport::server::TcpConnectInfo::remote_addr);
+        OpenTelemetrySpanExt::set_attribute(self, "remote_addr", remote_addr);


LMK any others you would like to add

https://opentelemetry.io/docs/specs/semconv/rpc/rpc-spans/#server-attributes

crates/utils/src/tracing/span_ext.rs

Mirko-von-Leipzig · 2025-02-27T09:38:20Z

crates/utils/src/tracing/span_ext.rs

+            .extensions()
+            .get::<tonic::transport::server::TcpConnectInfo>()
+            .and_then(tonic::transport::server::TcpConnectInfo::remote_addr);
+        OpenTelemetrySpanExt::set_attribute(self, "remote_addr", remote_addr);


https://opentelemetry.io/docs/specs/semconv/rpc/rpc-spans/#server-attributes

crates/store/src/server/mod.rs

crates/block-producer/Cargo.toml

crates/utils/src/tracing/grpc.rs

Mirko-von-Leipzig · 2025-02-28T10:36:34Z

crates/utils/src/tracing/grpc.rs

+        .get::<tonic::transport::server::TcpConnectInfo>()
+        .and_then(tonic::transport::server::TcpConnectInfo::remote_addr);
+    if let Some(addr) = remote_addr {
+        span.set_attribute("client.address", addr.ip());


This is the one that the spec says may be the DNS name if no DNS lookup is required.

Client address - domain name if available without reverse DNS lookup; otherwise, IP address or Unix domain socket name.

I'm wondering if we shouldn't use the text_map_propagator to cheat the system a bit, and on the client side inject block-producer and rpc as an additional property.

We could then use that here, and otherwise default to the IP if its missing.

Could also be done as a follow-up PR - I imagine you might be very much over this back-and-forth :D

Could also be done as a follow-up PR - I imagine you might be very much over this back-and-forth :D

Not at all - happy to iterate on this PR as much as required.

I'm wondering if we shouldn't use the text_map_propagator to cheat the system a bit, and on the client side inject block-producer and rpc as an additional property.
We could then use that here, and otherwise default to the IP if its missing.

I'm a bit dubious as to the utility of that functionality but happy to add that to this PR if you think it is worthwhile.

The main reason I'm considering it is because our components can (and currently do) run on the same server instance (and same process). Which means for the inter-component comms, the client and server IP address will just be the same since they're the same for all three components.

So I think it would help having the "caller" identifiable at least. Though maybe instead we should be hosting different api endpoints for this internal communication in any case e.g. store.block-producer/get_block_inputs and store.rpc/get_state, which would do the same job and provide some additional decoupling.

I'm happy to just punt this to a separate issue instead.

Ah ok yea that makes sense. Thanks for elaborating

crates/utils/src/tracing/span_ext.rs

Mirko-von-Leipzig

This is looking really good imo.

Couple more questions but really just minor things at this stage.

Mirko-von-Leipzig

I'm actually happy to merge this now, if you're happy?

sergerad · 2025-03-03T08:59:44Z

I'm actually happy to merge this now, if you're happy?

All good from my end 👍

sergerad commented Feb 16, 2025

View reviewed changes

sergerad mentioned this pull request Feb 18, 2025

Add custom OpenTelemetrySpanExt trait #700

Merged

sergerad force-pushed the sergerad/tower-http-trace branch from ee29fb9 to a8897a1 Compare February 21, 2025 06:25

sergerad commented Feb 22, 2025

View reviewed changes

crates/block-producer/src/server.rs Outdated Show resolved Hide resolved

sergerad force-pushed the sergerad/tower-http-trace branch 2 times, most recently from 73fbb93 to 2b2bc7e Compare February 22, 2025 10:04

Mirko-von-Leipzig reviewed Feb 25, 2025

View reviewed changes

crates/block-producer/src/server.rs Outdated Show resolved Hide resolved

sergerad commented Feb 27, 2025

View reviewed changes

crates/store/src/server/mod.rs Show resolved Hide resolved

sergerad marked this pull request as ready for review February 27, 2025 07:35

sergerad requested a review from Mirko-von-Leipzig February 27, 2025 07:35

sergerad added 5 commits February 27, 2025 20:36

Use tower_http::TraceLayer in block producer

9efc28a

Add trace feature to root import

3d1c073

Add info! and error! lines

3cdfe80

Fix honeycomb URL

4f4a936

Use http trace layer

47e9d82

sergerad commented Feb 27, 2025

View reviewed changes

sergerad added 7 commits February 27, 2025 20:39

Add trace unit test to mempool

fda2d1c

Use grpc trace layer and update store rpc

420b385

replace unwrap logic

7c0e025

Add ext trait fns and use for http requests

5d13200

Use default callbacks

89c934e

RM tests

8c195db

Comments

9f3ede7

sergerad force-pushed the sergerad/tower-http-trace branch from a248a8d to 9f3ede7 Compare February 27, 2025 07:40

Fmt

d891882

Mirko-von-Leipzig added the no changelog label Feb 27, 2025

Unused deps

2f6c71d

Mirko-von-Leipzig reviewed Feb 27, 2025

View reviewed changes

Undo trait fns

3d4a941

sergerad added 2 commits February 28, 2025 11:15

Add Otel attributes

2f52fea

RM tracing test src

099764e

sergerad changed the title ~~Use tower trace layer~~ chore: add Open Telemetry attributes to grpc spans Feb 28, 2025

Fix toml

20ee988

sergerad commented Feb 28, 2025

View reviewed changes

crates/utils/src/tracing/grpc.rs Show resolved Hide resolved

Mirko-von-Leipzig reviewed Feb 28, 2025

View reviewed changes

crates/utils/src/tracing/grpc.rs Outdated Show resolved Hide resolved

Mirko-von-Leipzig reviewed Feb 28, 2025

View reviewed changes

crates/utils/src/tracing/span_ext.rs Show resolved Hide resolved

Mirko-von-Leipzig reviewed Feb 28, 2025

View reviewed changes

sergerad added 2 commits March 1, 2025 19:55

Move i64

270e3c6

port u16

6091af1

Mirko-von-Leipzig approved these changes Mar 3, 2025

View reviewed changes

Mirko-von-Leipzig merged commit 3c5f411 into 0xPolygonMiden:next Mar 3, 2025
10 checks passed

This was referenced Mar 3, 2025

Instrument HTTP requests & responses #681

Closed

Separate telemetry by component #687

Open

Otel testing #725

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add Open Telemetry attributes to grpc spans #698

chore: add Open Telemetry attributes to grpc spans #698

sergerad commented Feb 16, 2025 •

edited

Loading

sergerad Feb 16, 2025

Mirko-von-Leipzig Feb 17, 2025

Mirko-von-Leipzig Feb 18, 2025

sergerad Feb 20, 2025

sergerad Feb 20, 2025

Mirko-von-Leipzig Feb 20, 2025

sergerad Feb 21, 2025

Mirko-von-Leipzig Feb 21, 2025

sergerad Feb 21, 2025

sergerad Feb 22, 2025

Mirko-von-Leipzig Feb 25, 2025

sergerad Feb 27, 2025

Mirko-von-Leipzig Feb 27, 2025

Mirko-von-Leipzig Feb 27, 2025

Mirko-von-Leipzig Feb 28, 2025

sergerad Mar 1, 2025

Mirko-von-Leipzig Mar 3, 2025 •

edited

Loading

sergerad Mar 3, 2025

Mirko-von-Leipzig left a comment

Mirko-von-Leipzig left a comment

sergerad commented Mar 3, 2025

chore: add Open Telemetry attributes to grpc spans #698

chore: add Open Telemetry attributes to grpc spans #698

Conversation

sergerad commented Feb 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mirko-von-Leipzig Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

sergerad commented Mar 3, 2025

sergerad commented Feb 16, 2025 •

edited

Loading

Mirko-von-Leipzig Mar 3, 2025 •

edited

Loading