add prometheus metrics by Rob-Johnson · Pull Request #16 · Yelp/nerve

Rob-Johnson · 2026-02-06T15:23:20Z

I've tried to split into smaller commits to make this easier to review.

adds a /metrics prometheus endpoint to nerve. this means running a webserver in nerve, in a background thread.

I've added an initial set of metrics that should give us some insights on nerve itself, and its interactions with zookeeper. also added the build_info metric so we can compare performance on new releases (we'll be iterating on this loads now right?)

the prom server is off by default - enabled by a feature toggle

nemacysts

we might wanna keep an eye on nerve memory use after this to make sure this isn't ballooning in size - but seems fine to me otherwise!

nemacysts · 2026-02-06T16:09:34Z

lib/nerve/reporter/zookeeper.rb

          statsd.increment("nerve.reporter.zk.client.created", tags: ["zk_cluster:#{@zk_cluster}"])
        end
        @zk = @@zk_pool[@zk_connection_string]
+        prom_set(:zk_pool_size, @@zk_pool_count[@zk_connection_string], labels: {zk_cluster: @zk_cluster})


ruby is a magical language

(and by that i mean: what in the perl is ruby doing)

nemacysts · 2026-02-06T16:12:58Z

lib/nerve/reporter/zookeeper.rb

        statsd.time("nerve.reporter.zk.delete.elapsed_time", tags: ["zk_cluster:#{@zk_cluster}"]) do
          @zk.delete(@full_key, ignore: :no_node)
        end
+        prom_observe(:zk_operation_duration_seconds, Time.now - prom_start, labels: {zk_cluster: @zk_cluster, operation: "delete"})


i dunno the right ruby words: but i sorta wonder if we should see how hard it'd be to make some sort of context-manager-like thing (like what i assume statsd.time) is so that we don't need to repeat the Time.now bits everywhere?

yeah, great catch. fixed

Add PrometheusMetrics module with WEBrick HTTP server, registry, helper methods (prom_inc/prom_set/prom_observe), and build_info gauge. Wire into Nerve, ServiceWatcher, and Reporter::Base. Configuration via "prometheus" block in nerve config (enabled, port, bind, histogram_buckets_zk, histogram_buckets_main_loop). Server startup deferred until after --check-config early exit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add watcher gauges (desired/running/up/down, repeated_report_failures_max), counters (config_reloads, watcher_launches/stops/throttled, report_results, reporter_ping_results), and main_loop_duration histogram. Expose repeated_report_failures attr_reader on ServiceWatcher for main loop aggregation. update_prom_gauges recomputes aggregate state each iteration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add zk_connected gauge (1/0 per cluster), zk_pool_size gauge, zk_write_failures_total counter (primary alerting metric), and zk_operation_duration_seconds histogram for create/save/delete ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

giuli007

@Rob-Johnson late but worth mentioning

a nit about time.now reported inline
crash if port is used, also reported by llm

 [High] Enabling Prometheus can crash the Nerve main loop if the HTTP server fails to bind (for example, port already in use). PrometheusMetrics.configure calls WEBrick::HTTPServer.new without rescue and the
     exception bubbles up, and Nerve#run re-raises, terminating the process. Consider rescuing Errno::EADDRINUSE (and similar) and either disabling metrics or logging and continuing. lib/nerve/
     prometheus_metrics.rb:32-46, lib/nerve.rb:171-216.

I guess we might decide to just fail-fast and not follow the advice though?

giuli007 · 2026-02-16T18:25:18Z

lib/nerve.rb


      statsd.time("nerve.main_loop.elapsed_time") do
        until $EXIT
+          main_loop_start = Time.now


an llm pointed out to me

[Low] main_loop_duration_seconds uses Time.now for elapsed time, which is wall‑clock and can jump backward/forward (NTP or clock changes), skewing histogram values. Use Process.clock_gettime(Process::CLOCK_MONOTONIC) like prom_time does. lib/nerve.rb:97,210, lib/nerve/prometheus_metrics.rb:197-201.

which sounds reasonable.
I remember always pointing this out when people were using time.now() in python.
Maybe it's not the end of the world as we might see hiccups only during daylight savings but if it's easy to change maybe we should?

yeah good catch - I'll open a PR

Rob-Johnson · 2026-02-16T19:00:15Z

@Rob-Johnson late but worth mentioning

1. a nit about time.now reported inline

2. crash if port is used, also reported by llm

 [High] Enabling Prometheus can crash the Nerve main loop if the HTTP server fails to bind (for example, port already in use). PrometheusMetrics.configure calls WEBrick::HTTPServer.new without rescue and the
     exception bubbles up, and Nerve#run re-raises, terminating the process. Consider rescuing Errno::EADDRINUSE (and similar) and either disabling metrics or logging and continuing. lib/nerve/
     prometheus_metrics.rb:32-46, lib/nerve.rb:171-216.

I guess we might decide to just fail-fast and not follow the advice though?

I think if we've configured prometheus but we can't bind to the port, we should fail-fast rather than just continuing without prom enabled

Rob-Johnson requested review from cuza, giuli007, ilkinmammadzada and nemacysts February 6, 2026 15:23

Rob-Johnson force-pushed the u/robj/metrics branch 2 times, most recently from b8cb592 to f3f3859 Compare February 6, 2026 15:37

nemacysts approved these changes Feb 6, 2026

View reviewed changes

Rob-Johnson force-pushed the u/robj/metrics branch from e620917 to 4421db4 Compare February 9, 2026 14:36

nemacysts approved these changes Feb 9, 2026

View reviewed changes

Rob-Johnson force-pushed the u/robj/metrics branch from 4421db4 to f5ec07d Compare February 16, 2026 12:43

ilkinmammadzada approved these changes Feb 16, 2026

View reviewed changes

Rob-Johnson force-pushed the u/robj/metrics branch from f5ec07d to 4fca524 Compare February 16, 2026 13:01

ilkinmammadzada approved these changes Feb 16, 2026

View reviewed changes

Rob-Johnson and others added 3 commits February 16, 2026 08:01

Rob-Johnson force-pushed the u/robj/metrics branch from 8c4780c to cbfb774 Compare February 16, 2026 16:04

Rob-Johnson merged commit 5d791bd into master Feb 16, 2026
2 checks passed

giuli007 reviewed Feb 16, 2026

View reviewed changes

Rob-Johnson mentioned this pull request Mar 4, 2026

add cli config options #23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add prometheus metrics#16

add prometheus metrics#16
Rob-Johnson merged 3 commits intomasterfrom
u/robj/metrics

Rob-Johnson commented Feb 6, 2026

Uh oh!

nemacysts left a comment

Uh oh!

nemacysts Feb 6, 2026

Uh oh!

nemacysts Feb 6, 2026

Uh oh!

nemacysts Feb 6, 2026

Uh oh!

Rob-Johnson Feb 9, 2026

Uh oh!

Uh oh!

giuli007 left a comment •

edited

Loading

Uh oh!

giuli007 Feb 16, 2026

Uh oh!

Rob-Johnson Feb 16, 2026

Uh oh!

Rob-Johnson commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Rob-Johnson commented Feb 6, 2026

Uh oh!

nemacysts left a comment

Choose a reason for hiding this comment

Uh oh!

nemacysts Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

nemacysts Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

nemacysts Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Rob-Johnson Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giuli007 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giuli007 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Rob-Johnson Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Rob-Johnson commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

giuli007 left a comment •

edited

Loading