This is a draft, not ready to be merged.
Summary of what is added by this commit:
- Prometheus counter for cache hit ratio of only AC requests.
- Support for prometheus labels based on custom HTTP and gRPC headers.
Cache hit ratio for CAS entries is easily misinterpreted. Example:
A typical action cache hit often involves 3 or more HTTP requests:
GET AC 200
GET CAS 200 (.o file)
GET CAS 200 (.d file)
...
But a cache miss for the same action is typically a single HTTP request:
GET AC 404
The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
the cache hit ratio experienced by the user for actions. The ratio of only
AC requests is easier to reason about, especially when AC requests checks
existence of CAS dependencies.
The number of AC hits and misses can be directly compared against numbers
printed in the end of each build by bazel client. And against other
prometheus counters produced by remote execution systems for executed actions.
An understanding about the reason for cache misses is necessary to improve the
cache hit ratio. It could be that the system has been configured in a way
that prevent artifacts from being reused between different OS. Or that the
cache is only populated by CI jobs on master, potentially resulting in cache
misses for other users, etc. It becomes easier to notice such patterns, if
cache hit ratio could be calculated for different categories of builds.
Such categories can be set as custom headers via bazel flags
--remote_header=branch=master and applied as prometheus labels. Mapping of
headers to prometheus labels are controlled in bazel-remote's config file.
The ratio between cache uploads and cache misses is also relevant, as an
view about which categories are not uploading their result. The ratio of cache
uploads can also indicate if much is uploaded but seldom requested. E.g. does it
make sense to populate central caches from interactive builds or only from CI?
Categories and custom headers, could also be set for an overview about:
- Bazel versions using a cache instance?
- How much separate organizations are using a cache instance?
- From which network traffic originates?
- Which products are built using the cache?
- If the traffic comes via proxy adding its own headers?
- Distinguish dummy requests for monitoring the cache, from real requests?
- ...