Add AC hit rate metrics with prometheus labels #350

This is a draft, not ready to be merged. Summary of what is added by this commit: - Prometheus counter for cache hit ratio of only AC requests. - Support for prometheus labels based on custom HTTP and gRPC headers. Cache hit ratio for CAS entries is easily misinterpreted. Example: A typical action cache hit often involves 3 or more HTTP requests: GET AC 200 GET CAS 200 (.o file) GET CAS 200 (.d file) ... But a cache miss for the same action is typically a single HTTP request: GET AC 404 The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent the cache hit ratio experienced by the user for actions. The ratio of only AC requests is easier to reason about, especially when AC requests checks existence of CAS dependencies. The number of AC hits and misses can be directly compared against numbers printed in the end of each build by bazel client. And against other prometheus counters produced by remote execution systems for executed actions. An understanding about the reason for cache misses is necessary to improve the cache hit ratio. It could be that the system has been configured in a way that prevent artifacts from being reused between different OS. Or that the cache is only populated by CI jobs on master, potentially resulting in cache misses for other users, etc. It becomes easier to notice such patterns, if cache hit ratio could be calculated for different categories of builds. Such categories can be set as custom headers via bazel flags --remote_header=branch=master and applied as prometheus labels. Mapping of headers to prometheus labels are controlled in bazel-remote's config file. The ratio between cache uploads and cache misses is also relevant, as an view about which categories are not uploading their result. The ratio of cache uploads can also indicate if much is uploaded but seldom requested. E.g. does it make sense to populate central caches from interactive builds or only from CI? Categories and custom headers, could also be set for an overview about: - Bazel versions using a cache instance? - How much separate organizations are using a cache instance? - From which network traffic originates? - Which products are built using the cache? - If the traffic comes via proxy adding its own headers? - Distinguish dummy requests for monitoring the cache, from real requests? - ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AC hit rate metrics with prometheus labels #350

Add AC hit rate metrics with prometheus labels #350

Commits on Sep 16, 2020

Add AC hit rate metrics with prometheus labels #350

Are you sure you want to change the base?

Add AC hit rate metrics with prometheus labels #350

Commits on Sep 16, 2020