Change some reported metrics from Prometheus' `Counter` to `Histogram` #5458

mfuntowicz · 2021-06-29T08:53:20Z

mfuntowicz
Jun 29, 2021

Currently, Triton reports metrics as Counter which are monotonically increasing for all the tracked metrics.

Still, for some of them, it might be interesting to report distribution(s) instead of single value aggregation (cumulative), especially on latencies:

nv_inference_request_duration_us => Cummulative inference request duration in microseconds
nv_inference_queue_duration_us => Cummulative inference queuing duration in microseconds
nv_inference_compute_input_duration_us => Cummulative compute input duration in microseconds
nv_inference_compute_infer_duration_us => Cummulative compute inference duration in microseconds
nv_inference_compute_output_duration_us => Cummulative inference compute output duration in microseconds

When looking at those, the cumulative only captures some piece of the latency information whereas an Histogram would report few more insights about what is going on w.r.t latency on Triton.

Actually, using an Histogram would also reports the cumulative as the Counter currently does.

From the Prometheus doc:

A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.

If it's something which makes sense, I'm happy to contribute to these changes.

🤗

jbkyang-nvi · 2021-06-30T22:13:46Z

jbkyang-nvi
Jun 30, 2021
Collaborator

Sorry for the late response and thanks for the suggestions! I have filed a enhancement ticket for the team

0 replies

dzier · 2021-07-06T23:07:17Z

dzier
Jul 6, 2021
Maintainer

@mfuntowicz, we would be happy to have your contribution. We just ask you to first fill out the CCLA form found on the top level of the main branch. After that, you can submit a PR from your cloned repo.

0 replies

RichiH · 2021-07-26T10:36:27Z

RichiH
Jul 26, 2021

Prometheus team member here.

For latencies and such, histograms are recommended. Ideally, you would expose seconds, not us, as we generally try to use SI base units. This makes correlation etc a lot easier. Our values are float anyway, so that shouldn't be an issue.

0 replies

rmccorm4 · 2023-03-07T00:24:30Z

rmccorm4
Mar 7, 2023
Collaborator

Hi @mfuntowicz @RichiH ,

I am looking into this now (sorry it has taken so long 🙂).

I'd like your feedback on the real world implications of these metrics being converted to histograms.

If these latency metrics were Histograms or Summaries today, which would you prefer? What would you want to see when querying them? Can you share some example queries that would be useful to you as a Triton user?
It looks like you have to define the buckets up front upon creating each histogram metric, and they cannot be changed dynamically.
- (a) Is a histogram with only a single bucket of <= Infinity of any use (in this case of durations/latencies)? Since bucket values have to be carefully chosen based on model execution time (and therefore dependent on model, GPU, system configuration, etc.), it would be tough to pick acceptable default bucket values without some first round of profiling before startup. Therefore, a sensible default may be the single bucket of <= Infinity which will functionally act as a counter unless the user provides their own bucket values. Then we could eventually phase out the counter metric in favor of this one.
- (b) Maybe a user already has an idea in mind for the buckets they're interested in based on their models and use case. In fact, these can be profiled beforehand with tools like Perf Analyzer. Some users have very strict latency requirements, and defining their own buckets may be the best solution for them.
- (c) Would the requirement of defining your own bucket values for every model / server / configuration be a sufficient UX tradeoff for the flexibility? Or would it be too cumbersome in practice to make users define them vs. just letting Summaries compute set quantiles (p50, p90, p95, p99, etc)?
- (d) Is there anything crucial that users would need to see out of having buckets exposed rather than just having quantiles exposed to them? I would argue if histogram buckets requested by most users is for the purpose of computing quantilies, we should just use Summaries to improve the user experience.

Given (2a) above, and the prometheus best practices doc, it sounds like Summaries may be more generically viable in Triton for the sake of accurately exposing quantile data regardless of the models involved.

choose a histogram if you have an idea of the range and distribution of values that will be observed. Choose a summary if you need an accurate quantile, no matter what the range and distribution of the values is.

Using Summary will incur some more additional overhead in Triton itself, but from the prometheus-cpp benchmarks, the overhead seems reasonable for the UX improvement of now having to select acceptable buckets for every model, GPU, system configuration, etc. on each deployment:

BM_Counter_Increment                                13 ns         12 ns   55616469
BM_Counter_Collect                                   7 ns          7 ns   99823170
...
BM_Histogram_Observe/0                             134 ns        124 ns    5665310
BM_Histogram_Collect/0                             103 ns        102 ns    5654829
...
BM_Summary_Observe/0/iterations:262144            9351 ns       9305 ns     262144
BM_Summary_Collect/0/0                              31 ns         30 ns   24873766

However, Summary will come with the desired quantile information out of the box without having to preprocess and estimate bucket values for each use case. ex:

# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us summary
nv_inference_request_duration_us_count{model="double",version="1"} 251
nv_inference_request_duration_us_sum{model="double",version="1"} 0.213827615
nv_inference_request_duration_us{model="double",version="1",quantile="0.5"} 0.000590498
nv_inference_request_duration_us{model="double",version="1",quantile="0.9"} 0.000682057
nv_inference_request_duration_us{model="double",version="1",quantile="0.99"} 0.000809368

Any other feedback, thoughts, or ideas you have are welcome.

CC @GuanLuo @Tabrizian @jbkyang-nvi @MarkMoTrin

1 reply

guptap11 Mar 13, 2023

One of the disadvantage of Summary is when you have multiple triton server instances and you want to get a combined quantile metric than its not possible to do it from just the summary, while with histogram if you have common buckets you can aggregate.
Even with above limitation I still feel Summary is better from ease of usability perspective.

dhaval24 · 2023-03-11T05:12:38Z

dhaval24
Mar 11, 2023

Having the quantile information is very important to set the right alerts and help us determine and estimate the bare-minimum resources needed to serve a traffic profile for the model in production without degraded experience.
It helps us to also understand better if we need to scale up the resources.

Quantiles, p50, p90, p95, p99, p99.99 are reasonable and it would save a lot of cost.

Also I want to note that I haven't found a resonable way to compute quantiles accurately from the counters using prometheus queries.

This is a cruicial ask for using triton in production environment. can you please prioritize it @rmccorm4

2 replies

rmccorm4 Mar 17, 2023
Collaborator

Hi @dhaval24 @guptap11,

What's the minimal or most useful set of metrics you would want to see these quantiles for? Could you perhaps rank them for your own use case with the motivations?

For example, I can see the nv_inference_request_duration_us and nv_inference_queue_duration_us quantiles being particularly useful when tracking your service's overall latency requirements / SLAs / etc.. However, the nv_compute_{input,infer,output}_* quantiles may be less useful in this regard since they'd be included within the overall nv_inference_request_duration_us times.

guptap11 Mar 27, 2023

Hi @rmccorm4 ,
nv_inference_compute_infer_duration_us , nv_inference_compute_input_duration_us and nv_inference_compute_output_duration_us are also useful for ensemble models because it can help in finding which model is taking more compute etc. If we can get quantile for all of them it will be useful.

kelkarn · 2024-06-24T02:00:31Z

kelkarn
Jun 24, 2024

Hi @rmccorm4 - I just wanted to follow up on this; is there an update on when we can expect these metrics to be exposed as Histogram instead of cumulative Counters?

With Counters, the best we can get is an ~average latency (e.g. rate(nv_inference_request_duration_us) / rate(nv_inference_exec_count)), which is not really great for setting alerts. As @dhaval24 mentioned above, usually p50, p90, p99 numbers are more desirable here (and that too collected across multiple nodes on which Triton is running).

1 reply

rmccorm4 Jul 8, 2024
Collaborator

Hi @kelkarn, thanks for reaching out about this, sorry for the delay as I was on vacation.

We currently have Summary metrics implemented as well. These may help alleviate some of the issues you described today, but may not be aggregatable across multiple Triton deployments.

As for supporting histogram types for the current latency metrics, this is something we'll need to plan and prioritize. We are open to contributions here as well if you interested, here is the current reference for the Summary implementation of the latency metrics. We would need to similarly expose histogram buckets (similar to config_.quantiles_) to users to support the various bucket thresholds that different use cases would require.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change some reported metrics from Prometheus' `Counter` to `Histogram` #5458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Change some reported metrics from Prometheus' Counter to Histogram #5458

mfuntowicz Jun 29, 2021

Replies: 6 comments · 4 replies

jbkyang-nvi Jun 30, 2021 Collaborator

dzier Jul 6, 2021 Maintainer

RichiH Jul 26, 2021

rmccorm4 Mar 7, 2023 Collaborator

guptap11 Mar 13, 2023

dhaval24 Mar 11, 2023

rmccorm4 Mar 17, 2023 Collaborator

guptap11 Mar 27, 2023

kelkarn Jun 24, 2024

rmccorm4 Jul 8, 2024 Collaborator

Change some reported metrics from Prometheus' `Counter` to `Histogram` #5458

mfuntowicz
Jun 29, 2021

Replies: 6 comments 4 replies

jbkyang-nvi
Jun 30, 2021
Collaborator

dzier
Jul 6, 2021
Maintainer

RichiH
Jul 26, 2021

rmccorm4
Mar 7, 2023
Collaborator

dhaval24
Mar 11, 2023

rmccorm4 Mar 17, 2023
Collaborator

kelkarn
Jun 24, 2024

rmccorm4 Jul 8, 2024
Collaborator