Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AC hit rate metrics with prometheus labels #350

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on Sep 16, 2020

  1. Add AC hit rate metrics with prometheus labels

    This is a draft, not ready to be merged.
    
    Summary of what is added by this commit:
    
     - Prometheus counter for cache hit ratio of only AC requests.
     - Support for prometheus labels based on custom HTTP and gRPC headers.
    
    Cache hit ratio for CAS entries is easily misinterpreted. Example:
    
      A typical action cache hit often involves 3 or more HTTP requests:
    
        GET AC 200
        GET CAS 200 (.o file)
        GET CAS 200 (.d file)
        ...
    
      But a cache miss for the same action is typically a single HTTP request:
    
        GET AC 404
    
    The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
    the cache hit ratio experienced by the user for actions. The ratio of only
    AC requests is easier to reason about, especially when AC requests checks
    existence of CAS dependencies.
    
    The number of AC hits and misses can be directly compared against numbers
    printed in the end of each build by bazel client. And against other
    prometheus counters produced by remote execution systems for executed actions.
    
    An understanding about the reason for cache misses is necessary to improve the
    cache hit ratio. It could be that the system has been configured in a way
    that prevent artifacts from being reused between different OS. Or that the
    cache is only populated by CI jobs on master, potentially resulting in cache
    misses for other users, etc. It becomes easier to notice such patterns, if
    cache hit ratio could be calculated for different categories of builds.
    Such categories can be set as custom headers via bazel flags
    --remote_header=branch=master and applied as prometheus labels. Mapping of
    headers to prometheus labels are controlled in bazel-remote's config file.
    
    The ratio between cache uploads and cache misses is also relevant, as an
    view about which categories are not uploading their result. The ratio of cache
    uploads can also indicate if much is uploaded but seldom requested. E.g. does it
    make sense to populate central caches from interactive builds or only from CI?
    
    Categories and custom headers, could also be set for an overview about:
     - Bazel versions using a cache instance?
     - How much separate organizations are using a cache instance?
     - From which network traffic originates?
     - Which products are built using the cache?
     - If the traffic comes via proxy adding its own headers?
     - Distinguish dummy requests for monitoring the cache, from real requests?
     - ...
    ulrfa committed Sep 16, 2020
    Configuration menu
    Copy the full SHA
    25e244e View commit details
    Browse the repository at this point in the history