Skip to content

feat(telemetry): Add Prometheus metrics endpoint for node monitoring#552

Closed
dvansari65 wants to merge 5 commits intosolana-foundation:mainfrom
dvansari65:feat/prometheus-metrics
Closed

feat(telemetry): Add Prometheus metrics endpoint for node monitoring#552
dvansari65 wants to merge 5 commits intosolana-foundation:mainfrom
dvansari65:feat/prometheus-metrics

Conversation

@dvansari65
Copy link
Contributor

--Add Prometheus metrics endpoint at /metrics (default: 0.0.0.0:9000)
--Track key node metrics: slot, epoch, transaction counts, WebSocket subscriptions
--Record metrics on every transaction via event-driven architecture
--Include feature flag prometheus with zero overhead when disabled
--Add CLI flag --metrics-enabled for easy activation
--Expose metrics: surfpool_slot, surfpool_epoch, surfpool_slot_index, surfpool_transactions_count, surfpool_transactions_processed_total, surfpool_uptime_seconds, surfpool_ws_subscriptions_total, surfpool_ws_signature_subscriptions, surfpool_ws_account_subscriptions, surfpool_ws_slot_subscriptions, surfpool_ws_logs_subscriptions
--Enable real-time monitoring and Prometheus integration for production observability

Copy link
Collaborator

@MicaiahReid MicaiahReid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dvansari65! I can tell a lot of work went into this!

I've got some minor comments now, and a general architectural question. In the coming days I'll test and likely have some more questions/comments.

But for now: it seems like you went with a "half-way" feature-gated approach. We always initialize the prometheus mod, but if the feature isn't enabled it's a no-op/warn. We always receive the SimnetEvent::MetricsData event, but if the feature isn't enabled it's a no-op again.

Why not just fully feature gate? The event doesn't exist/isn't received/isn't emitted, the telemetry mod isn't imported, and the CLI flags aren't there at all, unless the feature is enabled. Let me know if you've got thoughts, and thanks again!

Comment on lines +148 to +149
// Create prometheus exporter using the new 0.28 API
// In 0.28, opentelemetry-prometheus uses a different approach
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create prometheus exporter using the new 0.28 API
// In 0.28, opentelemetry-prometheus uses a different approach

Doesn't seem like a helpful comment :)

}
};

// Build resource using 0.28 API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Build resource using 0.28 API

.with_attributes(vec![KeyValue::new("service.name", service_name_owned)])
.build();

// Build meter provider using 0.28 API
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Build meter provider using 0.28 API

Comment on lines +175 to +183
if METER_PROVIDER.set(provider).is_err() {
result = Err("Meter provider already initialized".into());
return;
}

if METRICS.set(metrics).is_err() {
result = Err("Metrics already initialized".into());
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the only reasons for an error here? Why .is_err rather than if let Err(e) ... and returning the actual error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it

rt.block_on(async {
let registry_clone = registry.clone();

// Build axum 0.8 router with new path syntax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Build axum 0.8 router with new path syntax

@dvansari65
Copy link
Contributor Author

dvansari65 commented Mar 5, 2026

@MicaiahReid means , when feature flag is disabled then the code of my implementation should not exist at all , am I right?

@dvansari65
Copy link
Contributor Author

just tell me end to end workflow , I will try to add!

…lemetryConfig, CLI flags, and telemetry mod behind prometheus feature
@dvansari65 dvansari65 requested a review from MicaiahReid March 6, 2026 05:32
 - remove unneeded channel usage for metrics
 - reduce number of metrics to only leave those which may be actionable
Copy link
Collaborator

@MicaiahReid MicaiahReid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dvansari65, thanks for the changes! I pushed some changes to push things a bit further forward:

  • removed the simnet events for updating prometheus - we can just update directly from svm files to avoid that round trip
  • removed some promtheus metrics - a lot of what we had would just be noise for a prometheus server. we can add more actionable items in a subsequent PR

Please take a look at my changes and lmk if you have any concerns!

@dvansari65
Copy link
Contributor Author

@MicaiahReid ok , so should i remove my all metrics and go with your metrics #584

@MicaiahReid
Copy link
Collaborator

@MicaiahReid ok , so should i remove my all metrics and go with your metrics #584

I've already removed most - feel free to take some of #584!

@MicaiahReid MicaiahReid requested a review from lgalabru March 20, 2026 16:00
@dvansari65 dvansari65 closed this Mar 20, 2026
@dvansari65 dvansari65 deleted the feat/prometheus-metrics branch March 20, 2026 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants