Health check failings in /metrics categorized per subnet #1574

erwindassen · 2023-06-02T09:30:55Z

erwindassen
Jun 2, 2023

I wanted to open an issue with this request but since this is the first time I thought it would be more courteous to start a discussion first. I'll write here following the issue template to make it easier to move there after the discussion.

Context and scope

In v1.10.0 avalanchego added support for passing subnetIDs as argument to the /health endpoint and this was a very welcome addition. Unfortunately in the /metrics endpoints we still can only see the total of health checks that are failing via the avalanche_health_checks_failing metric (although there are health checks specific to the P,X and C chains). I would like to propose to add metrics for health checks per subnet. For example by tags:

avalanche_health_checks_failing{subnet_id="11111111111111111111111111111111LpoYY"} 0
avalanche_health_checks_failing{subnet_id="2wLe8Ma7YcUmxMJ57JVWETMSHz1mjXmJc5gmssvKm3Pw8GkcFq"} 1
avalanche_health_checks_failing{subnet_id="YDLrMpW9pkHPaRgRZR5fj883YUkJEoTc7XH28L8QBCY9v8FtV"} 0
...

This would allow infra to build custom alerts from prometheus metrics that identify issues with particular subnets. It is in general a good pattern to build alerts from prometheus which is solely responsible for scraping endpoints. In contrast, having to add custom logic to call JSON RPC methods to build alerts is in general an anti-pattern.

Discussion and alternatives

Add a metric with the appropriate tag (subnet_id) when a subnet registers itself with the /health endpoint.

Open questions

None at the moment.

StephenButtolph · 2023-06-02T20:51:36Z

StephenButtolph
Jun 2, 2023
Maintainer

Hmmm interesting. Currently the health checks allow a list of tags (not just the subnetID). Would there be an easy way to map that into the prometheus metrics format?

3 replies

StephenButtolph Jun 2, 2023
Maintainer

I suppose we could just have a metric for each tag and call it a day

aaronbuchwald Jun 2, 2023
Maintainer

This seems like a good use case for prometheus labels: https://prometheus.io/docs/practices/naming/#labels.

Curious what @erwindassen thinks would be the easiest format to consume

aaronbuchwald Jun 2, 2023
Maintainer

Ah I see that's exactly how you implemented it 🤝

StephenButtolph · 2023-06-02T21:36:32Z

StephenButtolph
Jun 2, 2023
Maintainer

How does #1579 look?

3 replies

erwindassen Jun 3, 2023
Author

Perfect! Even better. But I assume the metrics will be one per subnet tracked right?

erwindassen Jun 5, 2023
Author

Thanks, I see this will be merged eventually! 🙂

StephenButtolph Jun 8, 2023
Maintainer

#1579 was merged, so I'm going to close this discussion. Thanks for the recommendation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check failings in /metrics categorized per subnet #1574

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Health check failings in /metrics categorized per subnet #1574

erwindassen Jun 2, 2023

Replies: 2 comments · 6 replies

StephenButtolph Jun 2, 2023 Maintainer

StephenButtolph Jun 2, 2023 Maintainer

aaronbuchwald Jun 2, 2023 Maintainer

aaronbuchwald Jun 2, 2023 Maintainer

StephenButtolph Jun 2, 2023 Maintainer

erwindassen Jun 3, 2023 Author

erwindassen Jun 5, 2023 Author

StephenButtolph Jun 8, 2023 Maintainer

erwindassen
Jun 2, 2023

Replies: 2 comments 6 replies

StephenButtolph
Jun 2, 2023
Maintainer

StephenButtolph Jun 2, 2023
Maintainer

aaronbuchwald Jun 2, 2023
Maintainer

aaronbuchwald Jun 2, 2023
Maintainer

StephenButtolph
Jun 2, 2023
Maintainer

erwindassen Jun 3, 2023
Author

erwindassen Jun 5, 2023
Author

StephenButtolph Jun 8, 2023
Maintainer