-
Couldn't load subscription status.
- Fork 1.2k
fix: remove custom tracing middleware #3723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
556bdd5 to
b3b9c93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?
Thanks!
Yeah, I am trying to find a way to make this change that makes sense, but its kinda a headache. The middleware we have right now interferes with other tracing. Would you prefer that I just replace all the tracing all at once? I did a lot of testing, and discovered that we can use the auto instrumentation, but we need to do it programmatically due to a known quirk of using otel with uvicorn. This would mean that we would need telemetry installed and enabled by default, but we can disable it with environment variables. I am beginning to stage those WIP changes here: #3733 How do we feel about this design pattern? I made this comment in community the discord as well, I am happy to link you. My goal with this PR is to make the telemetry we have work well enough. Then we can migrate services to the new pattern we want one service at a time. Once that is done, we can deprecate the telemetry API. Once we merge this, and I finish implementing what is in the next PR, I can file tickets upstream for each place we capture custom instrumentation, and let you all help me with the migration. Its also an opportunity to go over what we capture with scrutiny to make sure what custom info we capture makes sense and isn't duplicated elsewhere. |
| # 2. If it has no parent (implicit root span from FastAPI instrumentation) | ||
| is_root_span = span.attributes.get(LOCAL_ROOT_SPAN_MARKER) or parent_span_id is None | ||
| root_span_id_value = span_id if is_root_span else None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehhuang take a look at this. I was able to get the integration test to work by doing this, but I am not 100% sure its right. I'd appreciate if you took a look and confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't either. Can we just kill this sqlite span processor alltogether and add tests analogous to those in test_*_telemetry but against OTEL?
6d92d69 to
f051458
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking good, but +1 on leaving out unrelated changes.
|
@iamemilio what's the status of this? Still on the pipe? Can you rebase? Thanks! |
|
I would like to get #3805 into shape first so that we have a stable way to verify that we are still meeting an agreed upon set of requirements for what telemetry data we collect and how it gets formatted. It should be ready to review! |
ab70029 to
58f587b
Compare
58f587b to
62651ce
Compare
|
This ensures that no matter how you deploy fastapi, http metrics get forwarded |
|
Would this change the existing traces logging at all? |
|
I don't think it will change the logging, since that is set up using the open telemetry console exporter, and this is just capturing more open telemetry trace data. It may capture more or different data from before though. That is harder to verify, since we don't really have detailed testing, but it does pass the basic test we set up earlier. |
|
Ack. How attached you are to the old data collection style? The OTEL library is following this convention for naming: https://opentelemetry.io/docs/specs/semconv/http/http-spans/ and generally as a principal, open telemetry tries to be minimal in what data it captures to balance overhead with data capture, so it is not going to try to create a span for every function that a trace passes through necessarily. How do you want to proceed here. It seems like what is being captured by open telemetry by default in comparison to the old system is an apples and oranges situation. |
I think it's ok that the spans change; however in this case it seems that the spans that we explicitly added in our code are missing? Those spans have attributes that are useful I think at least we should have similar level of information. BTW I thought you added some tests that assert on span attributes like 'model'. Did they not run or not cover this? |
|
Yes, I am a little surprised none of this got flagged in the test. I think its best I take a step back here, I think there are pieces of the big picture I am missing and a lot has changed since this was first proposed. I don't want to introduce issues to llama stack, and I was able to extract metrics in some of my tests, so I may want to re-consider my approach all together before creating more changes. |







What does this PR do?
Removes the custom tracing middleware from llama stack core. This middleware duplicates what otel already does for fast api by default, but breaks tracing by incorrectly handling w3 trace headers.
Feature: #3806
Depends On: #3723
Test Plan
Tested standalone and with open telemetry instrument. This ensures that HTTP metrics always get captured when instrumentation is enabled.