-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add usage statistics for inference API #894
base: main
Are you sure you want to change the base?
Conversation
ead6cb9
to
6609362
Compare
@dineshyv I think you also want to regenerate the Open API spec so this get's updated: I think you just need to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, approving to unblock, see my comment about the spec update.
@raghotham any comments on generality / future-facing stuff? @dineshyv hold on for a couple hours before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more discussion. Here is an alternate idea:
- we think of this as an extension of our telemetry API -- or rather telemetry needs to just work for this too.
- from that standpoint, we can make sure these stats get logged as "metrics" into our telemetry API and therefore can be queried
- this of course works only if the distro has a telemetry provider
however this is not sufficient because it makes for a terrible dev-ex for debugging while developing. when I want to understand usage I want each API calls usage to also be available. so how about all Llama Stack API calls are augmented by metrics?
@ashwinb Agree that we should send these metrics to telemetry as well. But retrieving/querying them would cause us to be dependent on a specific telemetry sink that supports metrics retrieval.
(1) is not great since it causes us to have an explicit dependency on a third party sink, while (2) sounds like reimplementing a whole metrics engine. Given this context, I think we should do the following:
Thoughts? Do you think there is anything I am missing? |
What does this PR do?
Adds a new usage field to inference APIs to indicate token counts for prompt and completion.