From b7fd81d7f8080097eb49df27b915024edeb93e1d Mon Sep 17 00:00:00 2001 From: haroldsphinx Date: Wed, 12 Jul 2023 12:22:13 +0100 Subject: [PATCH 01/12] Monitoing & Alerting guide Signed-off-by: haroldsphinx --- docs/int/quickstart/monitoring.md | 94 +++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 docs/int/quickstart/monitoring.md diff --git a/docs/int/quickstart/monitoring.md b/docs/int/quickstart/monitoring.md new file mode 100644 index 0000000000..bd7152f3a1 --- /dev/null +++ b/docs/int/quickstart/monitoring.md @@ -0,0 +1,94 @@ +# Getting Started Monitoring your Node + +Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters. + +## Pre-requisites + +Ensure the following software are installed: + +- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)** +- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** +- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana + +## Import Pre-Configured Charon Dashboards + +- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. +- In your Grafana interface, create a new dashboard and select the import option. + +![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png) + +- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. + +![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png) + +- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. + +![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png) + +## Example alerting rules + +- Alerts for Node-Exporter can be created using the sample rules provided here + +[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware) + +- For Charon/Alpha alerts, refer to the alerting rules available + +[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules) + +## Understanding Alert rules + +1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. +2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. +3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline. +4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. +5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. +6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. +7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. +8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. +9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. +10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0. +11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. + +## ****Best Practices for Monitoring Charon Nodes & Cluster**** + +- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies. +- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones. +- **Configure Alerts**: Based on these metrics, set up actionable alerts. +- **Monitor Network**: Regularly assess the connectivity between nodes and the network. +- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters. +- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities. +- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized. +- **Automate Monitoring**: Use automation to ensure no issues go undetected. +- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup. +- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions. + +## ****Third-Party Services for Uptime Testing**** + +- [updown.io](https://updown.io/) +- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE) + +## **Key metrics to watch to verify node health based on jobs** + +**node_exporter:** + +**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. + +**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning. + +**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system. + +**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests. + +**Disk Space**: Running out of disk space can lead to application errors and data loss. + +**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system. + +**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log. + +**Latency**: The delay before a transfer of data begins following an instruction for its transfer. + +It is also important to check: + +- NTP clock skew +- Process restarts and failures (eg. through `node_systemd`) +- alert on high error and panic log counts. \ No newline at end of file From 0304cad66f22692a892f491ffd83441afec9b970 Mon Sep 17 00:00:00 2001 From: haroldsphinx Date: Tue, 25 Jul 2023 10:09:16 +0100 Subject: [PATCH 02/12] Revert changes made to version docs Signed-off-by: haroldsphinx --- .../advanced/monitoring-credentials.md | 97 +------------------ 1 file changed, 1 insertion(+), 96 deletions(-) diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md index b2ae72f552..046aacc41f 100644 --- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md +++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md @@ -2,17 +2,6 @@ sidebar_position: 4 description: Add monitoring credentials to help the Obol Team monitor the health of your cluster --- -# Getting Started Monitoring your Node - -Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters. - -## Pre-requisites - -Ensure the following software are installed: - -- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)** -- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** -- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana # Push metrics to Obol Monitoring @@ -48,88 +37,4 @@ scrape_configs: - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] -``` - -## Import Pre-Configured Charon Dashboards - -- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. -- In your Grafana interface, create a new dashboard and select the import option. - -- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. - -- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. - -## Example alerting rules - -To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page: - -1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services. - -2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category. - -3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules. - -4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts. - -5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules. - - -For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules). - -## Understanding Alert rules - -1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. -2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. -3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline. -4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. -5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. -6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. -7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. -8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. -9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. -10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0. -11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. - -## Best Practices for Monitoring Charon Nodes & Cluster - -- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies. -- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones. -- **Configure Alerts**: Based on these metrics, set up actionable alerts. -- **Monitor Network**: Regularly assess the connectivity between nodes and the network. -- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters. -- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities. -- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized. -- **Automate Monitoring**: Use automation to ensure no issues go undetected. -- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup. -- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions. - -## Third-Party Services for Uptime Testing - -- [updown.io](https://updown.io/) -- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/) - -## Key metrics to watch to verify node health based on jobs - -### Node Exporter: - -**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. - -**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning. - -**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system. - -**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests. - -**Disk Space**: Running out of disk space can lead to application errors and data loss. - -**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system. - -**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log. - -**Latency**: The delay before a transfer of data begins following an instruction for its transfer. - -It is also important to check: - -- NTP clock skew -- Process restarts and failures (eg. through `node_systemd`) -- alert on high error and panic log counts. \ No newline at end of file +``` \ No newline at end of file From a73c5a8e79f9a49e5e4cfd38e7cdaca688b69aaa Mon Sep 17 00:00:00 2001 From: haroldsphinx Date: Tue, 25 Jul 2023 10:13:24 +0100 Subject: [PATCH 03/12] fix sidebars Signed-off-by: haroldsphinx --- docs/int/quickstart/advanced/monitoring-credentials.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md index b2ae72f552..56630303c8 100644 --- a/docs/int/quickstart/advanced/monitoring-credentials.md +++ b/docs/int/quickstart/advanced/monitoring-credentials.md @@ -14,7 +14,7 @@ Ensure the following software are installed: - Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** - Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana -# Push metrics to Obol Monitoring +## Push metrics to Obol Monitoring :::info This is **optional** and does not confer any special privileges within the Obol Network. @@ -110,7 +110,7 @@ For alerts specific to Charon/Alpha, refer to the alerting rules available on th ## Key metrics to watch to verify node health based on jobs -### Node Exporter: +- Node Exporter: **CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. From 05a7263e6c4892abc3c77276eff48a5158793d2d Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Thu, 10 Aug 2023 14:26:59 +0100 Subject: [PATCH 04/12] punctuation updated --- .../quickstart/advanced/monitoring-credentials.md | 6 +++--- docs/int/quickstart/monitoring.md | 12 ++++++------ 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md index 56630303c8..9e65fcdf6b 100644 --- a/docs/int/quickstart/advanced/monitoring-credentials.md +++ b/docs/int/quickstart/advanced/monitoring-credentials.md @@ -14,7 +14,7 @@ Ensure the following software are installed: - Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** - Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana -## Push metrics to Obol Monitoring +## Push Metrics to Obol Monitoring :::info This is **optional** and does not confer any special privileges within the Obol Network. @@ -59,7 +59,7 @@ scrape_configs: - Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. -## Example alerting rules +## Example Alerting Rules To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page: @@ -76,7 +76,7 @@ To create alerts for Node-Exporter, follow these steps based on the sample rules For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules). -## Understanding Alert rules +## Understanding Alert Rules 1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. 2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. diff --git a/docs/int/quickstart/monitoring.md b/docs/int/quickstart/monitoring.md index bd7152f3a1..e8bd5fe473 100644 --- a/docs/int/quickstart/monitoring.md +++ b/docs/int/quickstart/monitoring.md @@ -37,17 +37,17 @@ Ensure the following software are installed: ## Understanding Alert rules -1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. -2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. -3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline. -4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. +1. `AlphaClusterBeaconNodeDown`: This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. +2. `AlphaClusterBeaconNodeSyncing`: This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. +3. `AlphaClusterNodeDown`: This alert is activated when a node in a specified Alpha cluster is offline. +4. `AlphaClusterMissedAttestations`: This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. 5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. -6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. +6. `AlphaClusterInsufficientPeers`: This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. 7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. 8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. 9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. 10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0. -11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. +11. `PeerPingLatency: 1`: This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. ## ****Best Practices for Monitoring Charon Nodes & Cluster**** From 738d05deccfc3288bfd826dbd1feb126d14a4a78 Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Wed, 16 Aug 2023 17:28:44 +0100 Subject: [PATCH 05/12] updated sidebar, new page for push metrics added --- docs/int/quickstart/advanced/push-metrics | 40 +++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 docs/int/quickstart/advanced/push-metrics diff --git a/docs/int/quickstart/advanced/push-metrics b/docs/int/quickstart/advanced/push-metrics new file mode 100644 index 0000000000..8d9e0ceca1 --- /dev/null +++ b/docs/int/quickstart/advanced/push-metrics @@ -0,0 +1,40 @@ +--- +sidebar_position: 5 +description: Add monitoring credentials to help the Obol Team monitor the health of your cluster +--- + +# Push Metrics to Obol Monitoring + +:::info +This is **optional** and does not confer any special privileges within the Obol Network. +::: + +You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance. + +The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like: +``` +obol20!tnt8U!C... +``` + +The updated `prometheus/prometheus.yml` file should look like: +``` +global: + scrape_interval: 30s # Set the scrape interval to every 30 seconds. + evaluation_interval: 30s # Evaluate rules every 30 seconds. + +remote_write: + - url: https://vm.monitoring.gcp.obol.tech/write + authorization: + credentials: obol20!tnt8U!C... + +scrape_configs: + - job_name: 'charon' + static_configs: + - targets: ['charon:3620'] + - job_name: "lodestar" + static_configs: + - targets: [ "lodestar:5064" ] + - job_name: 'node-exporter' + static_configs: + - targets: ['node-exporter:9100'] +``` \ No newline at end of file From 0f382560d3a6e464e6dbc2b8d509e9f90ad4bfdb Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Wed, 16 Aug 2023 17:29:06 +0100 Subject: [PATCH 06/12] updated sidebar and new page for push metrics added --- .../quickstart/advanced/adv-docker-configs.md | 2 +- .../{advanced => }/monitoring-credentials.md | 1 + docs/int/quickstart/monitoring.md | 94 ------------------- .../quickstart/advanced/adv-docker-configs.md | 2 +- .../int/quickstart/advanced/prysm-vc.md | 2 +- .../int/quickstart/advanced/self-relay.md | 2 +- 6 files changed, 5 insertions(+), 98 deletions(-) rename docs/int/quickstart/{advanced => }/monitoring-credentials.md (99%) delete mode 100644 docs/int/quickstart/monitoring.md diff --git a/docs/int/quickstart/advanced/adv-docker-configs.md b/docs/int/quickstart/advanced/adv-docker-configs.md index 8a85d9f122..d14de53e8b 100644 --- a/docs/int/quickstart/advanced/adv-docker-configs.md +++ b/docs/int/quickstart/advanced/adv-docker-configs.md @@ -1,5 +1,5 @@ --- -sidebar_position: 5 +sidebar_position: 8 description: Use advanced docker-compose features to have more flexibility and power to change the default configuration. --- diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/monitoring-credentials.md similarity index 99% rename from docs/int/quickstart/advanced/monitoring-credentials.md rename to docs/int/quickstart/monitoring-credentials.md index 9e65fcdf6b..9ce2929f9e 100644 --- a/docs/int/quickstart/advanced/monitoring-credentials.md +++ b/docs/int/quickstart/monitoring-credentials.md @@ -53,6 +53,7 @@ scrape_configs: ## Import Pre-Configured Charon Dashboards - Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. + - In your Grafana interface, create a new dashboard and select the import option. - Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. diff --git a/docs/int/quickstart/monitoring.md b/docs/int/quickstart/monitoring.md deleted file mode 100644 index e8bd5fe473..0000000000 --- a/docs/int/quickstart/monitoring.md +++ /dev/null @@ -1,94 +0,0 @@ -# Getting Started Monitoring your Node - -Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters. - -## Pre-requisites - -Ensure the following software are installed: - -- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)** -- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** -- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana - -## Import Pre-Configured Charon Dashboards - -- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. -- In your Grafana interface, create a new dashboard and select the import option. - -![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png) - -- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. - -![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png) - -- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. - -![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png) - -## Example alerting rules - -- Alerts for Node-Exporter can be created using the sample rules provided here - -[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware) - -- For Charon/Alpha alerts, refer to the alerting rules available - -[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules) - -## Understanding Alert rules - -1. `AlphaClusterBeaconNodeDown`: This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. -2. `AlphaClusterBeaconNodeSyncing`: This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. -3. `AlphaClusterNodeDown`: This alert is activated when a node in a specified Alpha cluster is offline. -4. `AlphaClusterMissedAttestations`: This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. -5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. -6. `AlphaClusterInsufficientPeers`: This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. -7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. -8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. -9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. -10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0. -11. `PeerPingLatency: 1`: This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. - -## ****Best Practices for Monitoring Charon Nodes & Cluster**** - -- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies. -- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones. -- **Configure Alerts**: Based on these metrics, set up actionable alerts. -- **Monitor Network**: Regularly assess the connectivity between nodes and the network. -- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters. -- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities. -- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized. -- **Automate Monitoring**: Use automation to ensure no issues go undetected. -- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup. -- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions. - -## ****Third-Party Services for Uptime Testing**** - -- [updown.io](https://updown.io/) -- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE) - -## **Key metrics to watch to verify node health based on jobs** - -**node_exporter:** - -**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. - -**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning. - -**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system. - -**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests. - -**Disk Space**: Running out of disk space can lead to application errors and data loss. - -**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system. - -**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log. - -**Latency**: The delay before a transfer of data begins following an instruction for its transfer. - -It is also important to check: - -- NTP clock skew -- Process restarts and failures (eg. through `node_systemd`) -- alert on high error and panic log counts. \ No newline at end of file diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md index c58bae2359..7c99383c22 100644 --- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md +++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md @@ -1,5 +1,5 @@ --- -sidebar_position: 5 +sidebar_position: 6 description: Use advanced docker-compose features to have more flexibility and power to change the default configuration. --- diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md index 50e9b349fe..c79b57b374 100644 --- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md +++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md @@ -1,5 +1,5 @@ --- -sidebar_position: 6 +sidebar_position: 7 description: Run Prysm VCs in a DV --- diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md index ae157214b7..dfe6042d23 100644 --- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md +++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md @@ -1,5 +1,5 @@ --- -sidebar_position: 7 +sidebar_position: 8 description: Self-host a relay --- From 58e5de8d57c92f31bc96367bf7003398febe05d4 Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Thu, 17 Aug 2023 09:53:25 +0100 Subject: [PATCH 07/12] push-metrics updated to push-metrics.md --- docs/int/quickstart/advanced/{push-metrics => push-metrics.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/int/quickstart/advanced/{push-metrics => push-metrics.md} (100%) diff --git a/docs/int/quickstart/advanced/push-metrics b/docs/int/quickstart/advanced/push-metrics.md similarity index 100% rename from docs/int/quickstart/advanced/push-metrics rename to docs/int/quickstart/advanced/push-metrics.md From 1b6f8e52b3fa004de708244af3a5855f66bcaae0 Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Thu, 17 Aug 2023 09:55:12 +0100 Subject: [PATCH 08/12] updated location of monitoring credentials --- docs/int/quickstart/{ => advanced}/monitoring-credentials.md | 0 docs/int/quickstart/advanced/quickstart-combine.md | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/int/quickstart/{ => advanced}/monitoring-credentials.md (100%) diff --git a/docs/int/quickstart/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md similarity index 100% rename from docs/int/quickstart/monitoring-credentials.md rename to docs/int/quickstart/advanced/monitoring-credentials.md diff --git a/docs/int/quickstart/advanced/quickstart-combine.md b/docs/int/quickstart/advanced/quickstart-combine.md index ce6058e593..8d1025c641 100644 --- a/docs/int/quickstart/advanced/quickstart-combine.md +++ b/docs/int/quickstart/advanced/quickstart-combine.md @@ -1,5 +1,5 @@ --- -sidebar_position: 8 +sidebar_position: 9 description: Combine distributed validator private key shares to recover the validator private key. --- From 608a518b6d4c975ea31298ec2e7cbcae4d5c0249 Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Thu, 17 Aug 2023 10:13:03 +0100 Subject: [PATCH 09/12] push metrics section removed from monitoring page --- .../advanced/monitoring-credentials.md | 36 ------------------- 1 file changed, 36 deletions(-) diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md index 9ce2929f9e..fdbec169b9 100644 --- a/docs/int/quickstart/advanced/monitoring-credentials.md +++ b/docs/int/quickstart/advanced/monitoring-credentials.md @@ -14,42 +14,6 @@ Ensure the following software are installed: - Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** - Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana -## Push Metrics to Obol Monitoring - -:::info -This is **optional** and does not confer any special privileges within the Obol Network. -::: - -You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance. - -The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like: -``` -obol20!tnt8U!C... -``` - -The updated `prometheus/prometheus.yml` file should look like: -``` -global: - scrape_interval: 30s # Set the scrape interval to every 30 seconds. - evaluation_interval: 30s # Evaluate rules every 30 seconds. - -remote_write: - - url: https://vm.monitoring.gcp.obol.tech/write - authorization: - credentials: obol20!tnt8U!C... - -scrape_configs: - - job_name: 'charon' - static_configs: - - targets: ['charon:3620'] - - job_name: "lodestar" - static_configs: - - targets: [ "lodestar:5064" ] - - job_name: 'node-exporter' - static_configs: - - targets: ['node-exporter:9100'] -``` - ## Import Pre-Configured Charon Dashboards - Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. From a3085d28c2823b7f78a49deb23a455f1e7703c52 Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Thu, 31 Aug 2023 15:07:02 +0100 Subject: [PATCH 10/12] Push metric page URL slug changed to 'monitoring' --- docs/int/quickstart/advanced/{push-metrics.md => monitoring.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/int/quickstart/advanced/{push-metrics.md => monitoring.md} (100%) diff --git a/docs/int/quickstart/advanced/push-metrics.md b/docs/int/quickstart/advanced/monitoring.md similarity index 100% rename from docs/int/quickstart/advanced/push-metrics.md rename to docs/int/quickstart/advanced/monitoring.md From 0d4905f53c3271608ac04799858ad17693b8ebfe Mon Sep 17 00:00:00 2001 From: Maeliosa Date: Wed, 6 Sep 2023 11:45:43 +0100 Subject: [PATCH 11/12] slugs changed for monitoring docs --- .../advanced/monitoring-credentials.md | 100 ------------- docs/int/quickstart/advanced/monitoring.md | 132 +++++++++++++----- .../quickstart/advanced/obol-monitoring.md | 40 ++++++ 3 files changed, 136 insertions(+), 136 deletions(-) delete mode 100644 docs/int/quickstart/advanced/monitoring-credentials.md create mode 100644 docs/int/quickstart/advanced/obol-monitoring.md diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md deleted file mode 100644 index fdbec169b9..0000000000 --- a/docs/int/quickstart/advanced/monitoring-credentials.md +++ /dev/null @@ -1,100 +0,0 @@ ---- -sidebar_position: 4 -description: Add monitoring credentials to help the Obol Team monitor the health of your cluster ---- -# Getting Started Monitoring your Node - -Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters. - -## Pre-requisites - -Ensure the following software are installed: - -- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)** -- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** -- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana - -## Import Pre-Configured Charon Dashboards - -- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. - -- In your Grafana interface, create a new dashboard and select the import option. - -- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. - -- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. - -## Example Alerting Rules - -To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page: - -1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services. - -2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category. - -3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules. - -4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts. - -5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules. - - -For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules). - -## Understanding Alert Rules - -1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. -2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. -3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline. -4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. -5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. -6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. -7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. -8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. -9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. -10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0. -11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. - -## Best Practices for Monitoring Charon Nodes & Cluster - -- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies. -- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones. -- **Configure Alerts**: Based on these metrics, set up actionable alerts. -- **Monitor Network**: Regularly assess the connectivity between nodes and the network. -- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters. -- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities. -- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized. -- **Automate Monitoring**: Use automation to ensure no issues go undetected. -- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup. -- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions. - -## Third-Party Services for Uptime Testing - -- [updown.io](https://updown.io/) -- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/) - -## Key metrics to watch to verify node health based on jobs - -- Node Exporter: - -**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. - -**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning. - -**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system. - -**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests. - -**Disk Space**: Running out of disk space can lead to application errors and data loss. - -**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system. - -**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log. - -**Latency**: The delay before a transfer of data begins following an instruction for its transfer. - -It is also important to check: - -- NTP clock skew -- Process restarts and failures (eg. through `node_systemd`) -- alert on high error and panic log counts. \ No newline at end of file diff --git a/docs/int/quickstart/advanced/monitoring.md b/docs/int/quickstart/advanced/monitoring.md index 8d9e0ceca1..fdbec169b9 100644 --- a/docs/int/quickstart/advanced/monitoring.md +++ b/docs/int/quickstart/advanced/monitoring.md @@ -1,40 +1,100 @@ --- -sidebar_position: 5 +sidebar_position: 4 description: Add monitoring credentials to help the Obol Team monitor the health of your cluster --- +# Getting Started Monitoring your Node -# Push Metrics to Obol Monitoring - -:::info -This is **optional** and does not confer any special privileges within the Obol Network. -::: - -You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance. - -The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like: -``` -obol20!tnt8U!C... -``` - -The updated `prometheus/prometheus.yml` file should look like: -``` -global: - scrape_interval: 30s # Set the scrape interval to every 30 seconds. - evaluation_interval: 30s # Evaluate rules every 30 seconds. - -remote_write: - - url: https://vm.monitoring.gcp.obol.tech/write - authorization: - credentials: obol20!tnt8U!C... - -scrape_configs: - - job_name: 'charon' - static_configs: - - targets: ['charon:3620'] - - job_name: "lodestar" - static_configs: - - targets: [ "lodestar:5064" ] - - job_name: 'node-exporter' - static_configs: - - targets: ['node-exporter:9100'] -``` \ No newline at end of file +Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters. + +## Pre-requisites + +Ensure the following software are installed: + +- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)** +- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** +- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana + +## Import Pre-Configured Charon Dashboards + +- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. + +- In your Grafana interface, create a new dashboard and select the import option. + +- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. + +- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. + +## Example Alerting Rules + +To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page: + +1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services. + +2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category. + +3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules. + +4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts. + +5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules. + + +For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules). + +## Understanding Alert Rules + +1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. +2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. +3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline. +4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. +5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. +6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. +7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. +8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. +9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. +10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0. +11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. + +## Best Practices for Monitoring Charon Nodes & Cluster + +- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies. +- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones. +- **Configure Alerts**: Based on these metrics, set up actionable alerts. +- **Monitor Network**: Regularly assess the connectivity between nodes and the network. +- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters. +- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities. +- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized. +- **Automate Monitoring**: Use automation to ensure no issues go undetected. +- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup. +- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions. + +## Third-Party Services for Uptime Testing + +- [updown.io](https://updown.io/) +- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/) + +## Key metrics to watch to verify node health based on jobs + +- Node Exporter: + +**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. + +**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning. + +**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system. + +**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests. + +**Disk Space**: Running out of disk space can lead to application errors and data loss. + +**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system. + +**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log. + +**Latency**: The delay before a transfer of data begins following an instruction for its transfer. + +It is also important to check: + +- NTP clock skew +- Process restarts and failures (eg. through `node_systemd`) +- alert on high error and panic log counts. \ No newline at end of file diff --git a/docs/int/quickstart/advanced/obol-monitoring.md b/docs/int/quickstart/advanced/obol-monitoring.md new file mode 100644 index 0000000000..8d9e0ceca1 --- /dev/null +++ b/docs/int/quickstart/advanced/obol-monitoring.md @@ -0,0 +1,40 @@ +--- +sidebar_position: 5 +description: Add monitoring credentials to help the Obol Team monitor the health of your cluster +--- + +# Push Metrics to Obol Monitoring + +:::info +This is **optional** and does not confer any special privileges within the Obol Network. +::: + +You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance. + +The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like: +``` +obol20!tnt8U!C... +``` + +The updated `prometheus/prometheus.yml` file should look like: +``` +global: + scrape_interval: 30s # Set the scrape interval to every 30 seconds. + evaluation_interval: 30s # Evaluate rules every 30 seconds. + +remote_write: + - url: https://vm.monitoring.gcp.obol.tech/write + authorization: + credentials: obol20!tnt8U!C... + +scrape_configs: + - job_name: 'charon' + static_configs: + - targets: ['charon:3620'] + - job_name: "lodestar" + static_configs: + - targets: [ "lodestar:5064" ] + - job_name: 'node-exporter' + static_configs: + - targets: ['node-exporter:9100'] +``` \ No newline at end of file From adf04f867549d167185b37f89af426bc8492c754 Mon Sep 17 00:00:00 2001 From: thomasheremans Date: Mon, 25 Sep 2023 13:21:58 +0100 Subject: [PATCH 12/12] resolve v0.17 missing --- ...onitoring-credentials.md => monitoring.md} | 0 .../quickstart/advanced/obol-monitoring.md | 0 .../int/quickstart/advanced/push-metrics.md | 40 ------------------- 3 files changed, 40 deletions(-) rename versioned_docs/version-v0.17.0/int/quickstart/advanced/{monitoring-credentials.md => monitoring.md} (100%) rename docs/int/quickstart/advanced/push-metrics.md => versioned_docs/version-v0.17.0/int/quickstart/advanced/obol-monitoring.md (100%) delete mode 100644 versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md diff --git a/versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring-credentials.md b/versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring.md similarity index 100% rename from versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring-credentials.md rename to versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring.md diff --git a/docs/int/quickstart/advanced/push-metrics.md b/versioned_docs/version-v0.17.0/int/quickstart/advanced/obol-monitoring.md similarity index 100% rename from docs/int/quickstart/advanced/push-metrics.md rename to versioned_docs/version-v0.17.0/int/quickstart/advanced/obol-monitoring.md diff --git a/versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md b/versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md deleted file mode 100644 index 8d9e0ceca1..0000000000 --- a/versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md +++ /dev/null @@ -1,40 +0,0 @@ ---- -sidebar_position: 5 -description: Add monitoring credentials to help the Obol Team monitor the health of your cluster ---- - -# Push Metrics to Obol Monitoring - -:::info -This is **optional** and does not confer any special privileges within the Obol Network. -::: - -You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance. - -The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like: -``` -obol20!tnt8U!C... -``` - -The updated `prometheus/prometheus.yml` file should look like: -``` -global: - scrape_interval: 30s # Set the scrape interval to every 30 seconds. - evaluation_interval: 30s # Evaluate rules every 30 seconds. - -remote_write: - - url: https://vm.monitoring.gcp.obol.tech/write - authorization: - credentials: obol20!tnt8U!C... - -scrape_configs: - - job_name: 'charon' - static_configs: - - targets: ['charon:3620'] - - job_name: "lodestar" - static_configs: - - targets: [ "lodestar:5064" ] - - job_name: 'node-exporter' - static_configs: - - targets: ['node-exporter:9100'] -``` \ No newline at end of file