Conversation
b8cb592 to
f3f3859
Compare
nemacysts
left a comment
There was a problem hiding this comment.
we might wanna keep an eye on nerve memory use after this to make sure this isn't ballooning in size - but seems fine to me otherwise!
| statsd.increment("nerve.reporter.zk.client.created", tags: ["zk_cluster:#{@zk_cluster}"]) | ||
| end | ||
| @zk = @@zk_pool[@zk_connection_string] | ||
| prom_set(:zk_pool_size, @@zk_pool_count[@zk_connection_string], labels: {zk_cluster: @zk_cluster}) |
There was a problem hiding this comment.
(and by that i mean: what in the perl is ruby doing)
lib/nerve/reporter/zookeeper.rb
Outdated
| statsd.time("nerve.reporter.zk.delete.elapsed_time", tags: ["zk_cluster:#{@zk_cluster}"]) do | ||
| @zk.delete(@full_key, ignore: :no_node) | ||
| end | ||
| prom_observe(:zk_operation_duration_seconds, Time.now - prom_start, labels: {zk_cluster: @zk_cluster, operation: "delete"}) |
There was a problem hiding this comment.
i dunno the right ruby words: but i sorta wonder if we should see how hard it'd be to make some sort of context-manager-like thing (like what i assume statsd.time) is so that we don't need to repeat the Time.now bits everywhere?
e620917 to
4421db4
Compare
4421db4 to
f5ec07d
Compare
f5ec07d to
4fca524
Compare
Add PrometheusMetrics module with WEBrick HTTP server, registry, helper methods (prom_inc/prom_set/prom_observe), and build_info gauge. Wire into Nerve, ServiceWatcher, and Reporter::Base. Configuration via "prometheus" block in nerve config (enabled, port, bind, histogram_buckets_zk, histogram_buckets_main_loop). Server startup deferred until after --check-config early exit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add watcher gauges (desired/running/up/down, repeated_report_failures_max), counters (config_reloads, watcher_launches/stops/throttled, report_results, reporter_ping_results), and main_loop_duration histogram. Expose repeated_report_failures attr_reader on ServiceWatcher for main loop aggregation. update_prom_gauges recomputes aggregate state each iteration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add zk_connected gauge (1/0 per cluster), zk_pool_size gauge, zk_write_failures_total counter (primary alerting metric), and zk_operation_duration_seconds histogram for create/save/delete ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8c4780c to
cbfb774
Compare
There was a problem hiding this comment.
@Rob-Johnson late but worth mentioning
- a nit about time.now reported inline
- crash if port is used, also reported by llm
[High] Enabling Prometheus can crash the Nerve main loop if the HTTP server fails to bind (for example, port already in use). PrometheusMetrics.configure calls WEBrick::HTTPServer.new without rescue and the
exception bubbles up, and Nerve#run re-raises, terminating the process. Consider rescuing Errno::EADDRINUSE (and similar) and either disabling metrics or logging and continuing. lib/nerve/
prometheus_metrics.rb:32-46, lib/nerve.rb:171-216.
I guess we might decide to just fail-fast and not follow the advice though?
|
|
||
| statsd.time("nerve.main_loop.elapsed_time") do | ||
| until $EXIT | ||
| main_loop_start = Time.now |
There was a problem hiding this comment.
an llm pointed out to me
[Low] main_loop_duration_seconds uses Time.now for elapsed time, which is wall‑clock and can jump backward/forward (NTP or clock changes), skewing histogram values. Use
Process.clock_gettime(Process::CLOCK_MONOTONIC) like prom_time does. lib/nerve.rb:97,210, lib/nerve/prometheus_metrics.rb:197-201.
which sounds reasonable.
I remember always pointing this out when people were using time.now() in python.
Maybe it's not the end of the world as we might see hiccups only during daylight savings but if it's easy to change maybe we should?
There was a problem hiding this comment.
yeah good catch - I'll open a PR
I think if we've configured prometheus but we can't bind to the port, we should fail-fast rather than just continuing without prom enabled |
I've tried to split into smaller commits to make this easier to review.
adds a /metrics prometheus endpoint to nerve. this means running a webserver in nerve, in a background thread.
I've added an initial set of metrics that should give us some insights on nerve itself, and its interactions with zookeeper. also added the build_info metric so we can compare performance on new releases (we'll be iterating on this loads now right?)
the prom server is off by default - enabled by a feature toggle