Metrics interface(s) #355

bdittmer · 2020-09-24T03:43:35Z

Regarding metrics collection, will you consider a PR that replaces the hard dependency on prometheus with a dependency on one or more metrics interfaces? It will keep prometheus as the default/example implementation. Background:

We have a fork of bazel-remote to wire in our internal services (app config, logging, health checks, etc.). We also try (aspire?) to keep our fork as close to the upstream as possible. This is somewhat manageable with a separate main package and the available interfaces.

Internally we don't officially support prometheus. With the existing metrics implementation I think there are two options to avoid modifying upstream code:

Enable prometheus and run a sidecar (or goroutine) to periodically collect and transform metrics to our format.
Wrap cache handlers and inspect request/responses for basic things like cache hits/misses. I don't know if this is possible with the gRPC server.

I have a general idea what this looks like but want to gauge interest before committing to a PR. Thanks!

The text was updated successfully, but these errors were encountered:

mostynb · 2020-09-24T06:45:35Z

Hi @bdittmer.

@ulrfa has proposed some changes to the metrics in #350 - as part of that effort maybe we can move to a metrics interface that would minimise the maintenance cost of your fork? It would be good to have your input on that (both on the interface/config and also the metrics that are being collected). That might give you another option: a local patch to implement your metrics implementation, but with a cleaner integration point.

This is a draft, not ready to be merged. I wait with test cases until getting feedback about the main functionality. Same functionality as in buchgr#350 - Prometheus counter for cache hit ratio of only AC requests. - Support for prometheus labels based on custom HTTP and gRPC headers. but implemented in an alternative way: - disk.io implements two interfaces: cache.CasAcCache, cache.Stats - cache.metricsdecorator is decorator for cache.CasAcCache and provides prometheus metrics. Pros with this alternative implementation: - Should allow non-prometheus metrics as requested in buchgr#355 - Avoid the question about if the counters should be placed in disk.go or http.go/grpc*.go. If placing in disk.go, there are issues about how to avoid incrementing counter twice for the same request (both in Get and in GetValidatedActionResult) and at the same time count found AC but missing CAS, as cache miss. - Makes headers available also for logic in disk.go, which could be useful for other functionality in the future. - Metrics can be separated from logging, and still not require injecting counter increment invocations in tons of places. Incrementing from a few well defined places minimize the risk for bugs in the metrics.

ulrfa · 2020-09-26T23:23:11Z

Hi @bdittmer,

Would #358 be useful for you? It is an alternative implementation of #350.

bdittmer · 2020-09-28T20:29:11Z

Hi @ulrfa!

#358 is very interesting and I like the direction. Interfaces for the AC and CAS caches is very useful and makes layering functionality on top of disk.Cache a lot easier...for instance, creating an in-memory cache backed by a disk.Cache. I also think RequestContext is a solid introduction and can help with routing requests to different cache.Proxy backends.

But for metrics, I was thinking of something like this:

type MetricsCollector interface {
  Incr(name string, ctx cache.RequestContext)
  Guage(name string, value float64, ctx cache.RequestContext)
}

with concrete implementations in different packages (we shouldn't force users to pull in Prometheus dependencies if they don't use Prometheus). main.go can call then create a new Prometheus MetricsCollector and pass it around as needed. It's also useful for debugging (and testing) as you can create a logging metrics collector or a test implementation and assert against collected values.

Thoughts?

ulrfa · 2020-09-29T15:30:59Z

I’m afraid such interface would have to be more complex. Like handling labels, other metric types than counter and gauge, perhaps controlling when the counters are created, etc.

What about github.com/slok/go-http-metrics/middleware and grpc_prometheus.StreamServerInterceptor in main.go, that are depending om prometheus?

bdittmer · 2020-09-29T15:43:23Z

Yeah, that was a some what contrived and simple example.

To expand on it, I think each each client of a MetricsCollector should export string constants (or a type Counter string, type Gauge string, etc) for the different metrics they provide so implementations can handle things like labels by creating a map of metrics names to labels, metric names to help values, etc. By passing the context, a MetricsCollector implementation can do things like create and manage unique counters, gauges, etc. for different values provided in the HTTP headers.

I admittedly haven't investigated the implications for the gRPC metrics (we aren't using it) but will look into that.

Would a proof of concept PR with a reference prometheus implementation help?

ulrfa · 2020-09-30T13:01:39Z

Personally I prefer using prometheus official golang client library directly. I don’t fully understand your idea, but got the feeling that it would add complexity and overhead. And bazel-remote is using third party libraries github.com/slok/go-http-metrics/middleware and grpc_prometheus.StreamServerInterceptor that are using Prometheus. But it is not me that you need to convince. ;-)

bdittmer · 2020-10-03T18:36:47Z

Hi @ulrfa, I just opened #362 so you can get a better idea of what I'm thinking. I don't think it adds complexity or overhead, and for those of us who aren't fortunate enough to have infrastructure for supporting prometheus metrics I think this provides a nice middle ground.

Let me know what you think, thanks!

ulrfa · 2020-10-05T10:17:00Z

#362 implements only a subset of the Prometheus functionality. I’m lacking for example the Histogram type and Labels for all types. Prometheus has many methods for configuring labels in different ways. Others might lack other functionality. I belive it would become complex when considering all that.

I think I prefer having knowledge about counters as local as possible, e.g. in the file where they are incremented. Not also listing counters in a central file like:

bazel-remote/metric/prometheus/prometheus.go

Lines 19 to 29 in 94d39cb

    
           var help = map[string]string{ 
        
           	"bazel_remote_disk_cache_hits":                    "The total number of disk backend cache hits", 
        
           	"bazel_remote_disk_cache_misses":                  "The total number of disk backend cache misses", 
        
           	"bazel_remote_disk_cache_size_bytes":              "The current number of bytes in the disk backend", 
        
           	"bazel_remote_disk_cache_evicted_bytes_total":     "The total number of bytes evicted from disk backend, due to full cache", 
        
           	"bazel_remote_disk_cache_overwritten_bytes_total": "The total number of bytes removed from disk backend, due to put of already existing key", 
        
           	"bazel_remote_http_cache_hits":                    "The total number of HTTP backend cache hits", 
        
           	"bazel_remote_http_cache_misses":                  "The total number of HTTP backend cache misses", 
        
           	"bazel_remote_s3_cache_hits":                      "The total number of s3 backend cache hits", 
        
           	"bazel_remote_s3_cache_misses":                    "The total number of s3 backend cache misses", 
        
           }

mostynb · 2020-10-07T23:22:45Z

So far, I think #358 looks like the best way forward. It looks flexible, and I think it would give a clean integration point for downstream forks by replacing the metricsdecorator package with custom metrics support. @bdittmer - can you confirm this? Would there be any serious downsides for you?

bdittmer · 2020-10-08T00:58:09Z

Sorry, I have been busy with other work and haven't had an opportunity to respond to @ulrfa regarding his comments on my metrics abstraction proof of concept, #362.

My main issue is unresolved in #358: we don't use Prometheus for metrics collection and we don't need the hard dependency on Prometheus. Users shouldn't be forced to patch the upstream or depend on Prometheus if they don't use it. I think parts of #358 are good and other parts are not. It adds a lot of complexity that can be implemented by users in their own fork by implementing a simple interface. #362 is the best of both worlds: if a user wants to collect all their metrics on the cache side (we do not and don't think others should) than they can create their own metrics.Collector and either patch main.go or create their own main package. I am very interested to know how many users are actually interested in the functionality #358 introduces and how many are collecting metrics via their cache.

I'll comment on #358 tomorrow and respond to @ulrfa opinions tomorrow, thanks for keeping this on your radar @mostynb!

mostynb · 2020-10-12T21:50:37Z

My main issue is unresolved in #358: we don't use Prometheus for metrics collection and we don't need the hard dependency on Prometheus. Users shouldn't be forced to patch the upstream or depend on Prometheus if they don't use it.

From what I can see the only references to prometheus in #358 are in the metricsdecorator package (if we were to remove the preexisting metrics code), and that package is only referenced from main. Would that make it easy to skip this code and the prometheus dependency in your fork?

I am very interested to know how many users are actually interested in the functionality #358 introduces and how many are collecting metrics via their cache.

I don't have much insight into how bazel-remote is used in the wild, unfortunately.

bdittmer · 2020-10-14T20:48:45Z

If the Prometheus metrics are removed from other places (e.g. https://github.com/buchgr/bazel-remote/pull/358/files#diff-2a4bb11e9a9fb8613e7bb7dd4637b2e750894407c702f8bbf2584775f0127444L23-L24) and isolated to a single package that would help and that's exactly what #362 is trying to accomplish. The difference is #362 leaves the current metrics in place and makes introducing new metrics collection systems (for instance, statsd) or more complicated metrics logic (like applying labels) trivial. A user can create a new package with their metrics implementation and either patch it into main.go or roll their own main package.

mostynb · 2020-10-15T21:48:34Z

If #358 were to land, we would remove the preexisting metrics. Hopefully we can find a setup that works for both scenarios (prometheus with optional fancy labels, and drop-in replacements for other systems).

@bdittmer: is statsd what you're using?

kragniz · 2021-01-24T15:46:57Z

Is this still blocking patches for AC hit rate metrics?

I'm personally not sure allowing drop-in replacements for various instrumentation systems is worth the complexity. Could downstream integrations be based off the prometheus endpoint instead (like a reverse statsd_exporter)?

mostynb · 2021-01-26T08:09:25Z

@kragniz: I think the main blocker is that discussion stalled in #358 and I haven't nad the mental bandwidth to drive this. But your comment here seems to have restarted discussion in #358, so I'll try to re-read through it again tomorrow to refresh my memory.

ulrfa mentioned this issue Sep 25, 2020

Add AC hit rate metrics with prometheus labels #350

Draft

ulrfa mentioned this issue Sep 26, 2020

Add AC hit rate metrics with prometheus labels (alternative implementation with decorator) #358

Open

bdittmer mentioned this issue Oct 3, 2020

Introduce metric.Collector interface #362

Closed

bdittmer closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics interface(s) #355

Metrics interface(s) #355

bdittmer commented Sep 24, 2020

mostynb commented Sep 24, 2020

ulrfa commented Sep 26, 2020

bdittmer commented Sep 28, 2020

ulrfa commented Sep 29, 2020

bdittmer commented Sep 29, 2020 •

edited

Loading

ulrfa commented Sep 30, 2020

bdittmer commented Oct 3, 2020

ulrfa commented Oct 5, 2020

mostynb commented Oct 7, 2020

bdittmer commented Oct 8, 2020 •

edited

Loading

mostynb commented Oct 12, 2020

bdittmer commented Oct 14, 2020

mostynb commented Oct 15, 2020

kragniz commented Jan 24, 2021

mostynb commented Jan 26, 2021

Metrics interface(s) #355

Metrics interface(s) #355

Comments

bdittmer commented Sep 24, 2020

mostynb commented Sep 24, 2020

ulrfa commented Sep 26, 2020

bdittmer commented Sep 28, 2020

ulrfa commented Sep 29, 2020

bdittmer commented Sep 29, 2020 • edited Loading

ulrfa commented Sep 30, 2020

bdittmer commented Oct 3, 2020

ulrfa commented Oct 5, 2020

mostynb commented Oct 7, 2020

bdittmer commented Oct 8, 2020 • edited Loading

mostynb commented Oct 12, 2020

bdittmer commented Oct 14, 2020

mostynb commented Oct 15, 2020

kragniz commented Jan 24, 2021

mostynb commented Jan 26, 2021

bdittmer commented Sep 29, 2020 •

edited

Loading

bdittmer commented Oct 8, 2020 •

edited

Loading