fix: remove custom tracing middleware #3723

iamemilio · 2025-10-07T18:00:11Z

What does this PR do?

Removes the custom tracing middleware from llama stack core. This middleware duplicates what otel already does for fast api by default, but breaks tracing by incorrectly handling w3 trace headers.

Feature: #3806

Depends On: #3723

Test Plan

Tested standalone and with open telemetry instrument. This ensures that HTTP metrics always get captured when instrumentation is enabled.

leseb

Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?

Thanks!

llama_stack/core/server/server.py

iamemilio · 2025-10-08T14:11:55Z

Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?

Thanks!
@leseb Thanks for the review!

Yeah, I am trying to find a way to make this change that makes sense, but its kinda a headache. The middleware we have right now interferes with other tracing. Would you prefer that I just replace all the tracing all at once?

I did a lot of testing, and discovered that we can use the auto instrumentation, but we need to do it programmatically due to a known quirk of using otel with uvicorn. This would mean that we would need telemetry installed and enabled by default, but we can disable it with environment variables. I am beginning to stage those WIP changes here: #3733

How do we feel about this design pattern? I made this comment in community the discord as well, I am happy to link you.

My goal with this PR is to make the telemetry we have work well enough. Then we can migrate services to the new pattern we want one service at a time. Once that is done, we can deprecate the telemetry API.

Once we merge this, and I finish implementing what is in the next PR, I can file tickets upstream for each place we capture custom instrumentation, and let you all help me with the migration. Its also an opportunity to go over what we capture with scrutiny to make sure what custom info we capture makes sense and isn't duplicated elsewhere.

iamemilio · 2025-10-08T14:38:01Z

llama_stack/providers/inline/telemetry/meta_reference/sqlite_span_processor.py

+            # 2. If it has no parent (implicit root span from FastAPI instrumentation)
+            is_root_span = span.attributes.get(LOCAL_ROOT_SPAN_MARKER) or parent_span_id is None
+            root_span_id_value = span_id if is_root_span else None
+


@ehhuang take a look at this. I was able to get the integration test to work by doing this, but I am not 100% sure its right. I'd appreciate if you took a look and confirmed.

I don't either. Can we just kill this sqlite span processor alltogether and add tests analogous to those in test_*_telemetry but against OTEL?

cdoern

this is looking good, but +1 on leaving out unrelated changes.

llama_stack/core/server/server.py

iamemilio · 2025-10-08T16:54:31Z

Here is an example distributed trace with the changes in this PR from a client that was also instrumented sending a chat completion request to llama stack.

telemetry config:

  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: llama-stack-server
      otel_exporter_otlp_endpoint: http://localhost:4318
      sinks:
        - console
        - otel_metric
        - otel_trace

ehhuang · 2025-10-08T22:34:46Z

did we lose some spans with these changes?

leseb · 2025-10-15T10:26:16Z

@iamemilio what's the status of this? Still on the pipe? Can you rebase? Thanks!

iamemilio · 2025-10-15T14:14:44Z

I would like to get #3805 into shape first so that we have a stable way to verify that we are still meeting an agreed upon set of requirements for what telemetry data we collect and how it gets formatted. It should be ready to review!

iamemilio · 2025-10-23T18:46:40Z

iamemilio · 2025-10-23T18:49:15Z

This ensures that no matter how you deploy fastapi, http metrics get forwarded

ehhuang · 2025-10-23T19:07:06Z

Would this change the existing traces logging at all?

iamemilio · 2025-10-23T19:14:52Z

Captured running:

uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack build --distro starter --image-type venv --run

iamemilio · 2025-10-23T19:17:35Z

I don't think it will change the logging, since that is set up using the open telemetry console exporter, and this is just capturing more open telemetry trace data. It may capture more or different data from before though. That is harder to verify, since we don't really have detailed testing, but it does pass the basic test we set up earlier.

ehhuang · 2025-10-23T19:35:42Z

I just tried locally, it seems some spans are lost/changed?

Before

After

iamemilio · 2025-10-23T19:53:27Z

Ack. How attached you are to the old data collection style? The OTEL library is following this convention for naming: https://opentelemetry.io/docs/specs/semconv/http/http-spans/ and generally as a principal, open telemetry tries to be minimal in what data it captures to balance overhead with data capture, so it is not going to try to create a span for every function that a trace passes through necessarily. How do you want to proceed here. It seems like what is being captured by open telemetry by default in comparison to the old system is an apples and oranges situation.

ehhuang · 2025-10-23T21:25:33Z

Ack. How attached you are to the old data collection style? The OTEL library is following this convention for naming: https://opentelemetry.io/docs/specs/semconv/http/http-spans/ and generally as a principal, open telemetry tries to be minimal in what data it captures to balance overhead with data capture, so it is not going to try to create a span for every function that a trace passes through necessarily. How do you want to proceed here. It seems like what is being captured by open telemetry by default in comparison to the old system is an apples and oranges situation.

I think it's ok that the spans change; however in this case it seems that the spans that we explicitly added in our code are missing? Those spans have attributes that are useful

I think at least we should have similar level of information.

BTW I thought you added some tests that assert on span attributes like 'model'. Did they not run or not cover this?

iamemilio · 2025-10-24T14:17:10Z

Yes, I am a little surprised none of this got flagged in the test. I think its best I take a step back here, I think there are pieces of the big picture I am missing and a lot has changed since this was first proposed. I don't want to introduce issues to llama stack, and I was able to extract metrics in some of my tests, so I may want to re-consider my approach all together before creating more changes.

iamemilio requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners October 7, 2025 18:00

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025

cdoern mentioned this pull request Oct 7, 2025

fix: add traces for tool calls and mcp tool listing #3722

Merged

iamemilio force-pushed the remove_broken_middleware branch 4 times, most recently from 556bdd5 to b3b9c93 Compare October 8, 2025 13:51

leseb reviewed Oct 8, 2025

View reviewed changes

llama_stack/core/server/server.py Outdated Show resolved Hide resolved

iamemilio commented Oct 8, 2025

View reviewed changes

iamemilio force-pushed the remove_broken_middleware branch 2 times, most recently from 6d92d69 to f051458 Compare October 8, 2025 14:51

cdoern reviewed Oct 8, 2025

View reviewed changes

llama_stack/core/server/server.py Outdated Show resolved Hide resolved

iamemilio requested review from cdoern and leseb October 8, 2025 17:41

iamemilio mentioned this pull request Oct 14, 2025

Simple OpenTelemetry Native Telemetry System #3806

Open

9 tasks

iamemilio force-pushed the remove_broken_middleware branch from ab70029 to 58f587b Compare October 23, 2025 18:42

feat(telemetry): capture opentelemtry traces and metrics for fast API

62651ce

iamemilio force-pushed the remove_broken_middleware branch from 58f587b to 62651ce Compare October 23, 2025 18:45

fix(stale): remove tracing.py stale code

1548561

iamemilio changed the title ~~fix: remove broken tracing middleware~~ fix: remove custom tracing middleware Oct 23, 2025

Uh oh!

fix: remove custom tracing middleware #3723

Are you sure you want to change the base?

fix: remove custom tracing middleware #3723

Uh oh!

Conversation

iamemilio commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

leseb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iamemilio commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamemilio Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehhuang Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iamemilio commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehhuang commented Oct 8, 2025

Uh oh!

leseb commented Oct 15, 2025

Uh oh!

iamemilio commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamemilio commented Oct 23, 2025

Uh oh!

iamemilio commented Oct 23, 2025

Uh oh!

ehhuang commented Oct 23, 2025

Uh oh!

iamemilio commented Oct 23, 2025

Uh oh!

iamemilio commented Oct 23, 2025

Uh oh!

ehhuang commented Oct 23, 2025

Uh oh!

iamemilio commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehhuang commented Oct 23, 2025

Uh oh!

iamemilio commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iamemilio commented Oct 7, 2025 •

edited

Loading

iamemilio commented Oct 8, 2025 •

edited

Loading

iamemilio Oct 8, 2025 •

edited

Loading

iamemilio commented Oct 8, 2025 •

edited

Loading

iamemilio commented Oct 15, 2025 •

edited

Loading

iamemilio commented Oct 23, 2025 •

edited

Loading