-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: add Open Telemetry attributes to grpc spans #698
chore: add Open Telemetry attributes to grpc spans #698
Conversation
crates/block-producer/src/server.rs
Outdated
) | ||
.on_failure( | ||
|_error: GrpcFailureClass, _latency: Duration, _span: &tracing::Span| todo!(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig I agree that this is probably the correct abstraction to be using here (as you mentioned in this comment).
Any suggestions on how to test this locally? TY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its definitely a bit awkward to test. There is the text exporter which at least lets you debug/inspect the manually.
If we do want to uni/integration test these things, we'll probably have to write our own exporter - e.g. something to aggregate the data which we can then assert against. I'm unsure why this isn't already available tbh.
What I've been doing in the meantime is using https://www.honeycomb.io/ with a free account to test things out. There's some setup info in the wip guide here.
I essentially start a local node, configure otel to use the honeycomb endpoint and then generate traffic using the miden-client
integration test against my local node.
This might be a pita though - let me know if you run into issues, or if you can think of something smarter :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here we go: https://docs.rs/opentelemetry_sdk/latest/opentelemetry_sdk/testing/trace/index.html
They do actually have test infra :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks just getting back to this now. Will have a crack at integrating those trace test capabilities in the unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit tests
Or integration tests rather. Which would involve allowing the node to be configured with different SpanExporters to allow the test one to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I imagine this may take some more consideration, given that the exporters are (always?) global.
Might be good to simply explore the options and then discuss what is actually testable without much fuss.
Integration tests are a sore point at the moment - and likely will get worse before getting better. e.g. this thread. Its not quite trivial to just spin up a node at the moment, but depending on what you discover here maybe that's something we should aim at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig This is how I think the testing would have to work:
- Update a few functions to take in an
impl SpanExporter
. - Update node and faucet main.rs to construct either a test
SpanExporter
or an actual one based on flags or toml. - Implement tests that call the main functionality and assert on the
SpanData
coming out of the testSpanExporter
. These would probably be the existing integration tests but I haven't had a look yet.
But I don't think it is worthwhile. I have seen Rust projects do similar things where they run unit or integration tests and assert against log lines produced by the stack. I have generally avoided this kind of thing in the past, but I can imagine situations where its worth the maintenance cost and brittleness (tests being coupled to log lines).
I think it would become practical to just test trace output changes on the dev deployment/cluster once the trace scaffolding/impl stabilizes for the node. Rather than testing every change locally or in CI.
I have connected to honeycomb successfully and eye-balled the info!
output coming from my latest changes, e.g.:
2025-02-21T06:22:59.610587Z INFO block-producer.rpc/SubmitProvenTransaction: miden_node_block_producer::server: crates/block-producer/src/server.rs:218: request: POST /block_p
roducer.Api/SubmitProvenTransaction {"te": "trailers", "content-type": "application/grpc", "traceparent": "00-20616281d08ea7970f0701e71dd2ef80-453f6f8340af0947-01", "tracestate": "",
"user-agent": "tonic/0.12.3"}
Still getting my head around exactly what we want to achieve and whether the approach I've added so far fulfils that. The example above only has grpc related headers so I think I'll need to swap
let trace_layer = TraceLayer::new_for_grpc()
to
let trace_layer = TraceLayer::new_for_http()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think it is worthwhile. I have seen Rust projects do similar things where they run unit or integration tests and assert against log lines produced by the stack. I have generally avoided this kind of thing in the past, but I can imagine situations where its worth the maintenance cost and brittleness (tests being coupled to log lines).
I agree; isn't worth it especially not at our current project maturity level. I think I was hoping we could do something like:
#[test]
async fn block_builder_trace() {
let store = mock_store();
let exporter = TestSpanExporter::new()...;
exporter.register_for_this_test_only();
BlockBuilder::build_block(...).await;
assert_eq!(exporter.spans, expected_spans);
}
But I suspect its not trivially possible. Though maybe there is a way to have the global exporter registered as TestSpanExporter
and somehow associate the spans we get from this test only.
The example above only has grpc related headers so I think I'll need to swap
Yeah. I think the important things to get in are the server/client IP addresses which will (probably) only be available for http I imagine? Long term we'll also want to add interesting headers and/or CORS information maybe, but I'm unsure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The http fn only differs from the grpc one w.r.t the error type handled by error callback:
/// Create a new [`TraceLayer`] using [`ServerErrorsAsFailures`] which supports classifying
/// regular HTTP responses based on the status code.
pub fn new_for_http() -> Self {
Self {
make_classifier: SharedClassifier::new(ServerErrorsAsFailures::default()),
// vs: make_classifier: SharedClassifier::new(GrpcErrorsAsFailures::default()),
Think the best we can do is uri.host
. We won't be able to see source/client IP unless its put into a header by a proxy or some other part of this stack.
I suppose there is still value in the tower_http::trace::TraceLayer
because we can register callbacks here:
.on_request(|request: &http::Request<_>, _span: &tracing::Span| {
tracing::info!(
"request: {} {} {} {:?}",
request.method(),
request.uri.host
request.uri().host().unwrap_or("NOHOST"),
request.uri().path(),
request.headers()
);
})
.on_response(
|response: &http::Response<_>, latency: Duration, _span: &tracing::Span| {
tracing::info!("response: {} {:?}", response.status(), latency);
},
)
.on_failure(|error: GrpcFailureClass, _latency: Duration, _span: &tracing::Span| {
tracing::error!("error: {}", error);
Here is the diff in error type for the on failure callback:
/// The failure class for [`ServerErrorsAsFailures`].
#[derive(Debug)]
pub enum ServerErrorsFailureClass {
/// A response was classified as a failure with the corresponding status.
StatusCode(StatusCode),
/// A response was classified as an error with the corresponding error description.
Error(String),
}
...
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct StatusCode(NonZeroU16);
...
vs
/// The failure class for [`GrpcErrorsAsFailures`].
#[derive(Debug)]
pub enum GrpcFailureClass {
/// A gRPC response was classified as a failure with the corresponding status.
Code(std::num::NonZeroI32),
/// A gRPC response was classified as an error with the corresponding error description.
Error(String),
}
So might as well go with the http one I guess.
I'll look at the unit test approach you mentioned above and see if its possible.
ee29fb9
to
a8897a1
Compare
.count(), | ||
1, | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mirko-von-Leipzig I added this unit test to illustrate the otel trace test capability. It doesn't relate to the functionality added in this PR. This was just the easiest unit test I could find to impl. The block producer rpc stack doesn't have mocks etc atm.
LMK if you want to keep/rm this test or move it to another PR with other tests etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it for a separate PR.
Is it possible to separate by test itself - or do we need to run these kind of tests sequentially to ensure we get the spans recorded that we expect?
As in, running tests in parallel probably muddles the span exporter unless we somehow mark the spans with a test ID?
But this looks promising, we can figure out the specifics in the other PR/issue.
73fbb93
to
2b2bc7e
Compare
crates/utils/src/tracing/span_ext.rs
Outdated
.extensions() | ||
.get::<tonic::transport::server::TcpConnectInfo>() | ||
.and_then(tonic::transport::server::TcpConnectInfo::remote_addr); | ||
OpenTelemetrySpanExt::set_attribute(self, "remote_addr", remote_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LMK any others you would like to add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a248a8d
to
9f3ede7
Compare
crates/utils/src/tracing/span_ext.rs
Outdated
.extensions() | ||
.get::<tonic::transport::server::TcpConnectInfo>() | ||
.and_then(tonic::transport::server::TcpConnectInfo::remote_addr); | ||
OpenTelemetrySpanExt::set_attribute(self, "remote_addr", remote_addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.get::<tonic::transport::server::TcpConnectInfo>() | ||
.and_then(tonic::transport::server::TcpConnectInfo::remote_addr); | ||
if let Some(addr) = remote_addr { | ||
span.set_attribute("client.address", addr.ip()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the one that the spec says may be the DNS name if no DNS lookup is required.
Client address - domain name if available without reverse DNS lookup; otherwise, IP address or Unix domain socket name.
I'm wondering if we shouldn't use the text_map_propagator
to cheat the system a bit, and on the client side inject block-producer
and rpc
as an additional property.
We could then use that here, and otherwise default to the IP if its missing.
Could also be done as a follow-up PR - I imagine you might be very much over this back-and-forth :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be done as a follow-up PR - I imagine you might be very much over this back-and-forth :D
Not at all - happy to iterate on this PR as much as required.
I'm wondering if we shouldn't use the text_map_propagator to cheat the system a bit, and on the client side inject block-producer and rpc as an additional property.
We could then use that here, and otherwise default to the IP if its missing.
I'm a bit dubious as to the utility of that functionality but happy to add that to this PR if you think it is worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I'm considering it is because our components can (and currently do) run on the same server instance (and same process). Which means for the inter-component comms, the client and server IP address will just be the same since they're the same for all three components.
So I think it would help having the "caller" identifiable at least. Though maybe instead we should be hosting different api endpoints for this internal communication in any case e.g. store.block-producer/get_block_inputs
and store.rpc/get_state
, which would do the same job and provide some additional decoupling.
I'm happy to just punt this to a separate issue instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok yea that makes sense. Thanks for elaborating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good imo.
Couple more questions but really just minor things at this stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually happy to merge this now, if you're happy?
All good from my end 👍 |
Relates to #681.
tower::TraceLayer
in the tonic server builder for block producer rpc and store rpc