Skip to content

Introduce generic metrics for caches #1804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aaronmondal
Copy link
Contributor

@aaronmondal aaronmondal commented May 27, 2025

The naming scheme tries to follows the OpenTelemetry semantic conventions closely, but we might need to adjust it slightly in the future. Potentially expensive attribute allocation is zero-cost at runtime and despite the fairly heavy duration measurments there doesn't seem to be a relevant performance impact as all operations are in the nanosecond range when working with a memory-backed EvictingMap.

As an example we reinstrument the EvictingMap and slightly change its implementation to more clearly reflect the metrics we care about.


This change is Reviewable

The naming scheme tries to follows the OpenTelemetry semantic
conventions closely, but we might need to adjust it slightly in the
future. Potentially expensive attribute allocation is zero-cost at
runtime and despite the fairly heavy duration measurments there doesn't
seem to be a relevant performance impact as all operations are in the
nanosecond range when working with a memory-backed `EvictingMap`.

As an example we reinstrument the `EvictingMap` and slightly change its
implementation to more clearly reflect the metrics we care about.
Copy link
Contributor Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@jaroeichler cc @kubevalet @luis-munoz

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 9 files reviewed, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Coverage, Installation / macos-14, Installation / macos-15, Installation / ubuntu-22.04, Installation / ubuntu-24.04, Local / bazel / ubuntu-24.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-15, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, asan / ubuntu-24.04, buildstream, integration-tests (24.04), macos-15, pre-commit-checks, ubuntu-24.04, ubuntu-24.04 / stable, windows-2022 / stable (waiting on @jaroeichler)

Copy link
Contributor Author

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaroeichler FYI you can use this before you apply the gitrepo (i.e. you'd want these to be online before the NL deployment - note that tempo likely crashes but two instances are good enough):

---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMSingle
metadata:
  name: vmsingle
  namespace: default
spec:
  retentionPeriod: 1d
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VLogs
metadata:
  name: vlogs
  namespace: default
spec:
  retentionPeriod: 1d
  image: 
    repository: victoriametrics/victoria-logs
    tag: v1.18.0-victorialogs
---
apiVersion: tempo.grafana.com/v1alpha1
kind: TempoMonolithic
metadata:
  name: tempo
  namespace: default
spec:
  observability:
    grafana:
      dataSource:
        enabled: false
  jaegerui:
    enabled: true
    ingress:
      enabled: true
    resources:
      limits:
        cpu: '2'
        memory: 2Gi
  resources:
    limits:
      cpu: '2'
      memory: 2Gi
  storage:
    traces:
      backend: memory
---
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  # TODO(aaronmondal): Should probably run in a dedicated namespace.
  namespace: default
spec:
  mode: daemonset
  image: otel/opentelemetry-collector-contrib:0.123.0
  # Need to understan required resources first...
  # resources:
  #   limits:
  #     memory: 2Gi
  #     cpu: 4
  observability:
    metrics:
      enableMetrics: true
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            max_recv_msg_size_mib: 16
            read_buffer_size: 1048576    # 1MB
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 10

      batch/metrics:
        send_batch_size: 2000
        timeout: 5s
        send_batch_max_size: 4000

      batch/logs:
        send_batch_size: 1000
        timeout: 3s
        send_batch_max_size: 2000

      batch/traces:
        send_batch_size: 500
        timeout: 2s
        send_batch_max_size: 1000

      batch:
        send_batch_size: 512
        timeout: 5s
        send_batch_max_size: 1024
    exporters:
      prometheusremotewrite:
        endpoint: "http://vmsingle-vmsingle.default.svc:8429/prometheus/api/v1/write"
        tls:
          insecure: true
      otlphttp:
        logs_endpoint: "http://vlogs-vlogs.default.svc.cluster.local:9428/insert/opentelemetry/v1/logs"
        tls:
          insecure: true
        sending_queue:
          queue_size: 3000
      otlp:
        endpoint: "http://tempo-tempo.default.svc.cluster.local:4317"
        compression: zstd
        tls:
          insecure: true
        sending_queue:
          queue_size: 3000
    service:
      pipelines:
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch/metrics]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, batch/logs]
          exporters: [otlphttp]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch/traces]
          exporters: [otlp]

And then this to port-forward:

#!/usr/bin/env bash
set -euo pipefail

forwards=(
  "svc/vlogs-vlogs 9428:9428"
  "svc/vmsingle-vmsingle 8429:8429"
  "svc/tempo-tempo-jaegerui 16686:16686"
)

i=0
pids=()
labels=()

cleanup() {
  echo
  echo "🛑  Shutting down..."
  for pid in "${pids[@]}"; do kill "$pid" 2>/dev/null || true; done
  wait
  exit
}
trap cleanup SIGINT SIGTERM

for entry in "${forwards[@]}"; do
  # POSIX‑safe increment:
  i=$(( i + 1 ))
  echo "DEBUG: now i is $i"

  read -r resource ports <<<"$entry"
  label="fwd#$i"
  labels+=("$label")

  echo "▶️  [$label] kubectl port‑forward $resource $ports"
  kubectl port-forward "$resource" "$ports" \
    2>&1 | sed "s/^/[$label] /" &
  pids+=("$!")
done

wait

Reviewable status: 0 of 1 LGTMs obtained, and 0 of 9 files reviewed, and pending CI: Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Coverage, Installation / macos-14, Installation / macos-15, Installation / ubuntu-22.04, Installation / ubuntu-24.04, Local / bazel / ubuntu-24.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-15, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, asan / ubuntu-24.04, buildstream, integration-tests (24.04), macos-15, pre-commit-checks, ubuntu-24.04, ubuntu-24.04 / stable, windows-2022 / stable (waiting on @jaroeichler)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants