Introduce generic metrics for caches #1804

aaronmondal · 2025-05-27T04:20:44Z

The naming scheme tries to follows the OpenTelemetry semantic conventions closely, but we might need to adjust it slightly in the future. Potentially expensive attribute allocation is zero-cost at runtime and despite the fairly heavy duration measurments there doesn't seem to be a relevant performance impact as all operations are in the nanosecond range when working with a memory-backed EvictingMap.

As an example we reinstrument the EvictingMap and slightly change its implementation to more clearly reflect the metrics we care about.

This change is

The naming scheme tries to follows the OpenTelemetry semantic conventions closely, but we might need to adjust it slightly in the future. Potentially expensive attribute allocation is zero-cost at runtime and despite the fairly heavy duration measurments there doesn't seem to be a relevant performance impact as all operations are in the nanosecond range when working with a memory-backed `EvictingMap`. As an example we reinstrument the `EvictingMap` and slightly change its implementation to more clearly reflect the metrics we care about.

aaronmondal

+@jaroeichler cc @kubevalet @luis-munoz

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 9 files reviewed, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Coverage, Installation / macos-14, Installation / macos-15, Installation / ubuntu-22.04, Installation / ubuntu-24.04, Local / bazel / ubuntu-24.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-15, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, asan / ubuntu-24.04, buildstream, integration-tests (24.04), macos-15, pre-commit-checks, ubuntu-24.04, ubuntu-24.04 / stable, windows-2022 / stable (waiting on @jaroeichler)

aaronmondal

@jaroeichler FYI you can use this before you apply the gitrepo (i.e. you'd want these to be online before the NL deployment - note that tempo likely crashes but two instances are good enough):

---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMSingle
metadata:
  name: vmsingle
  namespace: default
spec:
  retentionPeriod: 1d
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VLogs
metadata:
  name: vlogs
  namespace: default
spec:
  retentionPeriod: 1d
  image: 
    repository: victoriametrics/victoria-logs
    tag: v1.18.0-victorialogs
---
apiVersion: tempo.grafana.com/v1alpha1
kind: TempoMonolithic
metadata:
  name: tempo
  namespace: default
spec:
  observability:
    grafana:
      dataSource:
        enabled: false
  jaegerui:
    enabled: true
    ingress:
      enabled: true
    resources:
      limits:
        cpu: '2'
        memory: 2Gi
  resources:
    limits:
      cpu: '2'
      memory: 2Gi
  storage:
    traces:
      backend: memory
---
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  # TODO(aaronmondal): Should probably run in a dedicated namespace.
  namespace: default
spec:
  mode: daemonset
  image: otel/opentelemetry-collector-contrib:0.123.0
  # Need to understan required resources first...
  # resources:
  #   limits:
  #     memory: 2Gi
  #     cpu: 4
  observability:
    metrics:
      enableMetrics: true
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            max_recv_msg_size_mib: 16
            read_buffer_size: 1048576    # 1MB
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 10

      batch/metrics:
        send_batch_size: 2000
        timeout: 5s
        send_batch_max_size: 4000

      batch/logs:
        send_batch_size: 1000
        timeout: 3s
        send_batch_max_size: 2000

      batch/traces:
        send_batch_size: 500
        timeout: 2s
        send_batch_max_size: 1000

      batch:
        send_batch_size: 512
        timeout: 5s
        send_batch_max_size: 1024
    exporters:
      prometheusremotewrite:
        endpoint: "http://vmsingle-vmsingle.default.svc:8429/prometheus/api/v1/write"
        tls:
          insecure: true
      otlphttp:
        logs_endpoint: "http://vlogs-vlogs.default.svc.cluster.local:9428/insert/opentelemetry/v1/logs"
        tls:
          insecure: true
        sending_queue:
          queue_size: 3000
      otlp:
        endpoint: "http://tempo-tempo.default.svc.cluster.local:4317"
        compression: zstd
        tls:
          insecure: true
        sending_queue:
          queue_size: 3000
    service:
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch/metrics]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch/logs]
          exporters: [otlphttp]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch/traces]
          exporters: [otlp]

And then this to port-forward:

#!/usr/bin/env bash
set -euo pipefail

forwards=(
  "svc/vlogs-vlogs 9428:9428"
  "svc/vmsingle-vmsingle 8429:8429"
  "svc/tempo-tempo-jaegerui 16686:16686"
)

i=0
pids=()
labels=()

cleanup() {
  echo
  echo "🛑  Shutting down..."
  for pid in "${pids[@]}"; do kill "$pid" 2>/dev/null || true; done
  wait
  exit
}
trap cleanup SIGINT SIGTERM

for entry in "${forwards[@]}"; do
  # POSIX‑safe increment:
  i=$(( i + 1 ))
  echo "DEBUG: now i is $i"

  read -r resource ports <<<"$entry"
  label="fwd#$i"
  labels+=("$label")

  echo "▶️  [$label] kubectl port‑forward $resource $ports"
  kubectl port-forward "$resource" "$ports" \
    2>&1 | sed "s/^/[$label] /" &
  pids+=("$!")
done

wait

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 9 files reviewed, and pending CI: Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Coverage, Installation / macos-14, Installation / macos-15, Installation / ubuntu-22.04, Installation / ubuntu-24.04, Local / bazel / ubuntu-24.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-15, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, asan / ubuntu-24.04, buildstream, integration-tests (24.04), macos-15, pre-commit-checks, ubuntu-24.04, ubuntu-24.04 / stable, windows-2022 / stable (waiting on @jaroeichler)

aaronmondal temporarily deployed to production May 27, 2025 04:21 — with GitHub Actions Inactive

aaronmondal assigned jaroeichler May 27, 2025

aaronmondal commented May 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce generic metrics for caches #1804

Introduce generic metrics for caches #1804

Uh oh!

aaronmondal commented May 27, 2025 •

edited

Loading

Uh oh!

aaronmondal left a comment

Uh oh!

aaronmondal left a comment

Uh oh!

Uh oh!

Introduce generic metrics for caches #1804

Are you sure you want to change the base?

Introduce generic metrics for caches #1804

Uh oh!

Conversation

aaronmondal commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronmondal left a comment

Choose a reason for hiding this comment

Uh oh!

aaronmondal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aaronmondal commented May 27, 2025 •

edited

Loading