Skip to content

add prometheus metrics#16

Merged
Rob-Johnson merged 3 commits intomasterfrom
u/robj/metrics
Feb 16, 2026
Merged

add prometheus metrics#16
Rob-Johnson merged 3 commits intomasterfrom
u/robj/metrics

Conversation

@Rob-Johnson
Copy link

I've tried to split into smaller commits to make this easier to review.

adds a /metrics prometheus endpoint to nerve. this means running a webserver in nerve, in a background thread.

I've added an initial set of metrics that should give us some insights on nerve itself, and its interactions with zookeeper. also added the build_info metric so we can compare performance on new releases (we'll be iterating on this loads now right?)

the prom server is off by default - enabled by a feature toggle

@Rob-Johnson Rob-Johnson force-pushed the u/robj/metrics branch 2 times, most recently from b8cb592 to f3f3859 Compare February 6, 2026 15:37
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might wanna keep an eye on nerve memory use after this to make sure this isn't ballooning in size - but seems fine to me otherwise!

statsd.increment("nerve.reporter.zk.client.created", tags: ["zk_cluster:#{@zk_cluster}"])
end
@zk = @@zk_pool[@zk_connection_string]
prom_set(:zk_pool_size, @@zk_pool_count[@zk_connection_string], labels: {zk_cluster: @zk_cluster})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruby is a magical language

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and by that i mean: what in the perl is ruby doing)

statsd.time("nerve.reporter.zk.delete.elapsed_time", tags: ["zk_cluster:#{@zk_cluster}"]) do
@zk.delete(@full_key, ignore: :no_node)
end
prom_observe(:zk_operation_duration_seconds, Time.now - prom_start, labels: {zk_cluster: @zk_cluster, operation: "delete"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dunno the right ruby words: but i sorta wonder if we should see how hard it'd be to make some sort of context-manager-like thing (like what i assume statsd.time) is so that we don't need to repeat the Time.now bits everywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, great catch. fixed

Rob-Johnson and others added 3 commits February 16, 2026 08:01
Add PrometheusMetrics module with WEBrick HTTP server, registry,
helper methods (prom_inc/prom_set/prom_observe), and build_info
gauge. Wire into Nerve, ServiceWatcher, and Reporter::Base.

Configuration via "prometheus" block in nerve config (enabled, port,
bind, histogram_buckets_zk, histogram_buckets_main_loop). Server
startup deferred until after --check-config early exit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add watcher gauges (desired/running/up/down, repeated_report_failures_max),
counters (config_reloads, watcher_launches/stops/throttled,
report_results, reporter_ping_results), and main_loop_duration histogram.

Expose repeated_report_failures attr_reader on ServiceWatcher for
main loop aggregation. update_prom_gauges recomputes aggregate state
each iteration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add zk_connected gauge (1/0 per cluster), zk_pool_size gauge,
zk_write_failures_total counter (primary alerting metric), and
zk_operation_duration_seconds histogram for create/save/delete ops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Rob-Johnson Rob-Johnson merged commit 5d791bd into master Feb 16, 2026
2 checks passed
Copy link

@giuli007 giuli007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Rob-Johnson late but worth mentioning

  1. a nit about time.now reported inline
  2. crash if port is used, also reported by llm
 [High] Enabling Prometheus can crash the Nerve main loop if the HTTP server fails to bind (for example, port already in use). PrometheusMetrics.configure calls WEBrick::HTTPServer.new without rescue and the
     exception bubbles up, and Nerve#run re-raises, terminating the process. Consider rescuing Errno::EADDRINUSE (and similar) and either disabling metrics or logging and continuing. lib/nerve/
     prometheus_metrics.rb:32-46, lib/nerve.rb:171-216.

I guess we might decide to just fail-fast and not follow the advice though?


statsd.time("nerve.main_loop.elapsed_time") do
until $EXIT
main_loop_start = Time.now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an llm pointed out to me

[Low] main_loop_duration_seconds uses Time.now for elapsed time, which is wall‑clock and can jump backward/forward (NTP or clock changes), skewing histogram values. Use
     Process.clock_gettime(Process::CLOCK_MONOTONIC) like prom_time does. lib/nerve.rb:97,210, lib/nerve/prometheus_metrics.rb:197-201.

which sounds reasonable.
I remember always pointing this out when people were using time.now() in python.
Maybe it's not the end of the world as we might see hiccups only during daylight savings but if it's easy to change maybe we should?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good catch - I'll open a PR

@Rob-Johnson
Copy link
Author

@Rob-Johnson late but worth mentioning

1. a nit about time.now reported inline

2. crash if port is used, also reported by llm
 [High] Enabling Prometheus can crash the Nerve main loop if the HTTP server fails to bind (for example, port already in use). PrometheusMetrics.configure calls WEBrick::HTTPServer.new without rescue and the
     exception bubbles up, and Nerve#run re-raises, terminating the process. Consider rescuing Errno::EADDRINUSE (and similar) and either disabling metrics or logging and continuing. lib/nerve/
     prometheus_metrics.rb:32-46, lib/nerve.rb:171-216.

I guess we might decide to just fail-fast and not follow the advice though?

I think if we've configured prometheus but we can't bind to the port, we should fail-fast rather than just continuing without prom enabled

@Rob-Johnson Rob-Johnson mentioned this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants