Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add usage statistics for inference API #894

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

dineshyv
Copy link
Contributor

What does this PR do?

Adds a new usage field to inference APIs to indicate token counts for prompt and completion.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2025
@dineshyv dineshyv changed the title add usage statistics for chat completion non streaming add usage statistics for inference API Jan 28, 2025
@vladimirivic
Copy link
Contributor

@dineshyv I think you also want to regenerate the Open API spec so this get's updated:
https://github.com/meta-llama/llama-stack/blob/main/docs/resources/llama-stack-spec.yaml#L1965-L1988

I think you just need to run docs/openapi_generator/run_openapi_generator.sh

Copy link
Contributor

@vladimirivic vladimirivic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approving to unblock, see my comment about the spec update.

@ashwinb
Copy link
Contributor

ashwinb commented Jan 29, 2025

@raghotham any comments on generality / future-facing stuff?

@dineshyv hold on for a couple hours before merging.

Copy link
Contributor

@ashwinb ashwinb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more discussion. Here is an alternate idea:

  • we think of this as an extension of our telemetry API -- or rather telemetry needs to just work for this too.
  • from that standpoint, we can make sure these stats get logged as "metrics" into our telemetry API and therefore can be queried
  • this of course works only if the distro has a telemetry provider

however this is not sufficient because it makes for a terrible dev-ex for debugging while developing. when I want to understand usage I want each API calls usage to also be available. so how about all Llama Stack API calls are augmented by metrics?

@dineshyv
Copy link
Contributor Author

@ashwinb Agree that we should send these metrics to telemetry as well. But retrieving/querying them would cause us to be dependent on a specific telemetry sink that supports metrics retrieval.
Since we depend on OTEL to export the telemetry data, we can be agnostic of the actual sink and give the users flexibility to configure different sinks as long as they are able to process OTEL data.
Since OTEL does not support retrieval, we will have to depend on a specific sink.
For traces and spans, we maintain an independent sqlite store for supporting retrieval.
For metrics, I see two options:

  1. Have an explicit dependency on third party sink like jaeger which supports metrics retrieval.
  2. Implement our own metrics retrieval in the sqllite sink

(1) is not great since it causes us to have an explicit dependency on a third party sink, while (2) sounds like reimplementing a whole metrics engine.
My preference would be to go for (1). But if we go with (1), since we are already dependent on the third party sink and will need to use their API, i don't see the value in us supporting an API which calls the sink's API since developers can directly call into the Sink's API.

Given this context, I think we should do the following:

  1. Only support metrics export through OTEL. We give the developer flexibility to choose the sink and they can use the sink specific API to query. I will work on this in a subsequent PR.
  2. I do see value in returning the usage statistics as part of the response and for that purpose, we keep this PR as is. We will not depend on telemetry API. I think its not such bad thing since the inference provider is the source of truth for this and we just return it as part of the response

Thoughts? Do you think there is anything I am missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants