-
Notifications
You must be signed in to change notification settings - Fork 191
Introduce generic metrics for caches #1804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Introduce generic metrics for caches #1804
Conversation
The naming scheme tries to follows the OpenTelemetry semantic conventions closely, but we might need to adjust it slightly in the future. Potentially expensive attribute allocation is zero-cost at runtime and despite the fairly heavy duration measurments there doesn't seem to be a relevant performance impact as all operations are in the nanosecond range when working with a memory-backed `EvictingMap`. As an example we reinstrument the `EvictingMap` and slightly change its implementation to more clearly reflect the metrics we care about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@jaroeichler cc @kubevalet @luis-munoz
Reviewable status: 0 of 1 LGTMs obtained, and 0 of 9 files reviewed, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Coverage, Installation / macos-14, Installation / macos-15, Installation / ubuntu-22.04, Installation / ubuntu-24.04, Local / bazel / ubuntu-24.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-15, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, asan / ubuntu-24.04, buildstream, integration-tests (24.04), macos-15, pre-commit-checks, ubuntu-24.04, ubuntu-24.04 / stable, windows-2022 / stable (waiting on @jaroeichler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaroeichler FYI you can use this before you apply the gitrepo (i.e. you'd want these to be online before the NL deployment - note that tempo likely crashes but two instances are good enough):
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMSingle
metadata:
name: vmsingle
namespace: default
spec:
retentionPeriod: 1d
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VLogs
metadata:
name: vlogs
namespace: default
spec:
retentionPeriod: 1d
image:
repository: victoriametrics/victoria-logs
tag: v1.18.0-victorialogs
---
apiVersion: tempo.grafana.com/v1alpha1
kind: TempoMonolithic
metadata:
name: tempo
namespace: default
spec:
observability:
grafana:
dataSource:
enabled: false
jaegerui:
enabled: true
ingress:
enabled: true
resources:
limits:
cpu: '2'
memory: 2Gi
resources:
limits:
cpu: '2'
memory: 2Gi
storage:
traces:
backend: memory
---
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
# TODO(aaronmondal): Should probably run in a dedicated namespace.
namespace: default
spec:
mode: daemonset
image: otel/opentelemetry-collector-contrib:0.123.0
# Need to understan required resources first...
# resources:
# limits:
# memory: 2Gi
# cpu: 4
observability:
metrics:
enableMetrics: true
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
read_buffer_size: 1048576 # 1MB
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 10
batch/metrics:
send_batch_size: 2000
timeout: 5s
send_batch_max_size: 4000
batch/logs:
send_batch_size: 1000
timeout: 3s
send_batch_max_size: 2000
batch/traces:
send_batch_size: 500
timeout: 2s
send_batch_max_size: 1000
batch:
send_batch_size: 512
timeout: 5s
send_batch_max_size: 1024
exporters:
prometheusremotewrite:
endpoint: "http://vmsingle-vmsingle.default.svc:8429/prometheus/api/v1/write"
tls:
insecure: true
otlphttp:
logs_endpoint: "http://vlogs-vlogs.default.svc.cluster.local:9428/insert/opentelemetry/v1/logs"
tls:
insecure: true
sending_queue:
queue_size: 3000
otlp:
endpoint: "http://tempo-tempo.default.svc.cluster.local:4317"
compression: zstd
tls:
insecure: true
sending_queue:
queue_size: 3000
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch/metrics]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch/logs]
exporters: [otlphttp]
traces:
receivers: [otlp]
processors: [memory_limiter, batch/traces]
exporters: [otlp]
And then this to port-forward:
#!/usr/bin/env bash
set -euo pipefail
forwards=(
"svc/vlogs-vlogs 9428:9428"
"svc/vmsingle-vmsingle 8429:8429"
"svc/tempo-tempo-jaegerui 16686:16686"
)
i=0
pids=()
labels=()
cleanup() {
echo
echo "🛑 Shutting down..."
for pid in "${pids[@]}"; do kill "$pid" 2>/dev/null || true; done
wait
exit
}
trap cleanup SIGINT SIGTERM
for entry in "${forwards[@]}"; do
# POSIX‑safe increment:
i=$(( i + 1 ))
echo "DEBUG: now i is $i"
read -r resource ports <<<"$entry"
label="fwd#$i"
labels+=("$label")
echo "▶️ [$label] kubectl port‑forward $resource $ports"
kubectl port-forward "$resource" "$ports" \
2>&1 | sed "s/^/[$label] /" &
pids+=("$!")
done
wait
Reviewable status: 0 of 1 LGTMs obtained, and 0 of 9 files reviewed, and pending CI: Bazel Dev / macos-15, Bazel Dev / ubuntu-24.04, Cargo Dev / macos-15, Cargo Dev / ubuntu-24.04, Coverage, Installation / macos-14, Installation / macos-15, Installation / ubuntu-22.04, Installation / ubuntu-24.04, Local / bazel / ubuntu-24.04, Local / lre-cc / ubuntu-24.04, Local / lre-rs / macos-15, Local / lre-rs / ubuntu-24.04, NativeLink.com Cloud / Remote Cache / macos-15, NativeLink.com Cloud / Remote Cache / ubuntu-24.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, asan / ubuntu-24.04, buildstream, integration-tests (24.04), macos-15, pre-commit-checks, ubuntu-24.04, ubuntu-24.04 / stable, windows-2022 / stable (waiting on @jaroeichler)
The naming scheme tries to follows the OpenTelemetry semantic conventions closely, but we might need to adjust it slightly in the future. Potentially expensive attribute allocation is zero-cost at runtime and despite the fairly heavy duration measurments there doesn't seem to be a relevant performance impact as all operations are in the nanosecond range when working with a memory-backed
EvictingMap
.As an example we reinstrument the
EvictingMap
and slightly change its implementation to more clearly reflect the metrics we care about.This change is