Skip to content

Conversation

@iamemilio
Copy link
Contributor

@iamemilio iamemilio commented Oct 7, 2025

What does this PR do?

Removes the custom tracing middleware from llama stack core. This middleware duplicates what otel already does for fast api by default, but breaks tracing by incorrectly handling w3 trace headers.

Feature: #3806

Depends On: #3723

Test Plan

Tested standalone and with open telemetry instrument. This ensures that HTTP metrics always get captured when instrumentation is enabled.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@iamemilio iamemilio force-pushed the remove_broken_middleware branch 4 times, most recently from 556bdd5 to b3b9c93 Compare October 8, 2025 13:51
Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?

Thanks!

@iamemilio
Copy link
Contributor Author

iamemilio commented Oct 8, 2025

Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?

Thanks!
@leseb Thanks for the review!

Yeah, I am trying to find a way to make this change that makes sense, but its kinda a headache. The middleware we have right now interferes with other tracing. Would you prefer that I just replace all the tracing all at once?

I did a lot of testing, and discovered that we can use the auto instrumentation, but we need to do it programmatically due to a known quirk of using otel with uvicorn. This would mean that we would need telemetry installed and enabled by default, but we can disable it with environment variables. I am beginning to stage those WIP changes here: #3733

How do we feel about this design pattern? I made this comment in community the discord as well, I am happy to link you.

My goal with this PR is to make the telemetry we have work well enough. Then we can migrate services to the new pattern we want one service at a time. Once that is done, we can deprecate the telemetry API.

Once we merge this, and I finish implementing what is in the next PR, I can file tickets upstream for each place we capture custom instrumentation, and let you all help me with the migration. Its also an opportunity to go over what we capture with scrutiny to make sure what custom info we capture makes sense and isn't duplicated elsewhere.

# 2. If it has no parent (implicit root span from FastAPI instrumentation)
is_root_span = span.attributes.get(LOCAL_ROOT_SPAN_MARKER) or parent_span_id is None
root_span_id_value = span_id if is_root_span else None

Copy link
Contributor Author

@iamemilio iamemilio Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehhuang take a look at this. I was able to get the integration test to work by doing this, but I am not 100% sure its right. I'd appreciate if you took a look and confirmed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't either. Can we just kill this sqlite span processor alltogether and add tests analogous to those in test_*_telemetry but against OTEL?

@iamemilio iamemilio force-pushed the remove_broken_middleware branch 2 times, most recently from 6d92d69 to f051458 Compare October 8, 2025 14:51
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good, but +1 on leaving out unrelated changes.

@iamemilio
Copy link
Contributor Author

iamemilio commented Oct 8, 2025

Screenshot 2025-10-08 at 12 54 00 PM

Here is an example distributed trace with the changes in this PR from a client that was also instrumented sending a chat completion request to llama stack.

telemetry config:

  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: llama-stack-server
      otel_exporter_otlp_endpoint: http://localhost:4318
      sinks:
        - console
        - otel_metric
        - otel_trace

@iamemilio iamemilio requested review from cdoern and leseb October 8, 2025 17:41
@ehhuang
Copy link
Contributor

ehhuang commented Oct 8, 2025

image

did we lose some spans with these changes?

@leseb
Copy link
Collaborator

leseb commented Oct 15, 2025

@iamemilio what's the status of this? Still on the pipe? Can you rebase? Thanks!

@iamemilio
Copy link
Contributor Author

iamemilio commented Oct 15, 2025

I would like to get #3805 into shape first so that we have a stable way to verify that we are still meeting an agreed upon set of requirements for what telemetry data we collect and how it gets formatted. It should be ready to review!

@iamemilio iamemilio force-pushed the remove_broken_middleware branch from ab70029 to 58f587b Compare October 23, 2025 18:42
@iamemilio iamemilio force-pushed the remove_broken_middleware branch from 58f587b to 62651ce Compare October 23, 2025 18:45
@iamemilio
Copy link
Contributor Author

Screenshot 2025-10-23 at 2 42 54 PM

@iamemilio
Copy link
Contributor Author

This ensures that no matter how you deploy fastapi, http metrics get forwarded

@ehhuang
Copy link
Contributor

ehhuang commented Oct 23, 2025

Would this change the existing traces logging at all?

@iamemilio
Copy link
Contributor Author

Screenshot 2025-10-23 at 3 13 54 PM

Captured running:

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack build --distro starter --image-type venv --run

@iamemilio
Copy link
Contributor Author

I don't think it will change the logging, since that is set up using the open telemetry console exporter, and this is just capturing more open telemetry trace data. It may capture more or different data from before though. That is harder to verify, since we don't really have detailed testing, but it does pass the basic test we set up earlier.

@ehhuang
Copy link
Contributor

ehhuang commented Oct 23, 2025

I just tried locally, it seems some spans are lost/changed?

Before
image
After
image

@iamemilio
Copy link
Contributor Author

iamemilio commented Oct 23, 2025

Ack. How attached you are to the old data collection style? The OTEL library is following this convention for naming: https://opentelemetry.io/docs/specs/semconv/http/http-spans/ and generally as a principal, open telemetry tries to be minimal in what data it captures to balance overhead with data capture, so it is not going to try to create a span for every function that a trace passes through necessarily. How do you want to proceed here. It seems like what is being captured by open telemetry by default in comparison to the old system is an apples and oranges situation.

@iamemilio iamemilio changed the title fix: remove broken tracing middleware fix: remove custom tracing middleware Oct 23, 2025
@ehhuang
Copy link
Contributor

ehhuang commented Oct 23, 2025

Ack. How attached you are to the old data collection style? The OTEL library is following this convention for naming: https://opentelemetry.io/docs/specs/semconv/http/http-spans/ and generally as a principal, open telemetry tries to be minimal in what data it captures to balance overhead with data capture, so it is not going to try to create a span for every function that a trace passes through necessarily. How do you want to proceed here. It seems like what is being captured by open telemetry by default in comparison to the old system is an apples and oranges situation.

I think it's ok that the spans change; however in this case it seems that the spans that we explicitly added in our code are missing? Those spans have attributes that are useful
image

I think at least we should have similar level of information.

BTW I thought you added some tests that assert on span attributes like 'model'. Did they not run or not cover this?

@iamemilio
Copy link
Contributor Author

Yes, I am a little surprised none of this got flagged in the test. I think its best I take a step back here, I think there are pieces of the big picture I am missing and a lot has changed since this was first proposed. I don't want to introduce issues to llama stack, and I was able to extract metrics in some of my tests, so I may want to re-consider my approach all together before creating more changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants