From b7fd81d7f8080097eb49df27b915024edeb93e1d Mon Sep 17 00:00:00 2001
From: haroldsphinx <adedayoakinpelu@gmail.com>
Date: Wed, 12 Jul 2023 12:22:13 +0100
Subject: [PATCH 01/12] Monitoing & Alerting guide

Signed-off-by: haroldsphinx <adedayoakinpelu@gmail.com>
---
 docs/int/quickstart/monitoring.md | 94 +++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 docs/int/quickstart/monitoring.md

diff --git a/docs/int/quickstart/monitoring.md b/docs/int/quickstart/monitoring.md
new file mode 100644
index 0000000000..bd7152f3a1
--- /dev/null
+++ b/docs/int/quickstart/monitoring.md
@@ -0,0 +1,94 @@
+# Getting Started Monitoring your Node
+
+Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
+
+## Pre-requisites
+
+Ensure the following software are installed:
+
+- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
+- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
+- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
+
+## Import Pre-Configured Charon Dashboards
+
+- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
+- In your Grafana interface, create a new dashboard and select the import option.
+
+![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png)
+
+- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
+
+![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png)
+
+- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
+
+![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png)
+
+## Example alerting rules
+
+- Alerts for Node-Exporter can be created using the sample rules provided here
+
+[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)
+
+- For Charon/Alpha alerts, refer to the alerting rules available
+
+[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules)
+
+## Understanding Alert rules
+
+1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
+2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
+3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
+4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
+5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
+6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
+7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
+8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
+9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
+10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0.
+11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
+
+## ****Best Practices for Monitoring Charon Nodes & Cluster****
+
+- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
+- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
+- **Configure Alerts**: Based on these metrics, set up actionable alerts.
+- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
+- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
+- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
+- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
+- **Automate Monitoring**: Use automation to ensure no issues go undetected.
+- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
+- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
+
+## ****Third-Party Services for Uptime Testing****
+
+- [updown.io](https://updown.io/)
+- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE)
+
+## **Key metrics to watch to verify node health based on jobs**
+
+**node_exporter:**
+
+**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
+
+**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
+
+**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
+
+**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
+
+**Disk Space**: Running out of disk space can lead to application errors and data loss.
+
+**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
+
+**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
+
+**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
+
+It is also important to check:
+
+- NTP clock skew
+- Process restarts and failures (eg. through `node_systemd`)
+- alert on high error and panic log counts.
\ No newline at end of file

From 0304cad66f22692a892f491ffd83441afec9b970 Mon Sep 17 00:00:00 2001
From: haroldsphinx <adedayoakinpelu@gmail.com>
Date: Tue, 25 Jul 2023 10:09:16 +0100
Subject: [PATCH 02/12] Revert changes made to version docs

Signed-off-by: haroldsphinx <adedayoakinpelu@gmail.com>
---
 .../advanced/monitoring-credentials.md        | 97 +------------------
 1 file changed, 1 insertion(+), 96 deletions(-)

diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md
index b2ae72f552..046aacc41f 100644
--- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md
+++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/monitoring-credentials.md
@@ -2,17 +2,6 @@
 sidebar_position: 4
 description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
 ---
-# Getting Started Monitoring your Node
-
-Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
-
-## Pre-requisites
-
-Ensure the following software are installed:
-
-- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
-- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
-- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
 
 # Push metrics to Obol Monitoring
 
@@ -48,88 +37,4 @@ scrape_configs:
   - job_name: 'node-exporter'
     static_configs:
       - targets: ['node-exporter:9100']
-```
-
-## Import Pre-Configured Charon Dashboards
-
-- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
-- In your Grafana interface, create a new dashboard and select the import option.
-
-- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
-
-- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
-
-## Example alerting rules
-
-To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page:
-
-1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services.
-   
-2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category.
-   
-3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules.
-   
-4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts.
-   
-5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules.
-
-
-For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules).
-
-## Understanding Alert rules
-
-1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
-2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
-3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
-4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
-5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
-6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
-7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
-8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
-9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
-10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0.
-11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
-
-## Best Practices for Monitoring Charon Nodes & Cluster
-
-- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
-- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
-- **Configure Alerts**: Based on these metrics, set up actionable alerts.
-- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
-- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
-- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
-- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
-- **Automate Monitoring**: Use automation to ensure no issues go undetected.
-- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
-- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
-
-## Third-Party Services for Uptime Testing
-
-- [updown.io](https://updown.io/)
-- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/)
-
-## Key metrics to watch to verify node health based on jobs
-
-### Node Exporter:
-
-**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
-
-**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
-
-**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
-
-**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
-
-**Disk Space**: Running out of disk space can lead to application errors and data loss.
-
-**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
-
-**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
-
-**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
-
-It is also important to check:
-
-- NTP clock skew
-- Process restarts and failures (eg. through `node_systemd`)
-- alert on high error and panic log counts.
\ No newline at end of file
+```
\ No newline at end of file

From a73c5a8e79f9a49e5e4cfd38e7cdaca688b69aaa Mon Sep 17 00:00:00 2001
From: haroldsphinx <adedayoakinpelu@gmail.com>
Date: Tue, 25 Jul 2023 10:13:24 +0100
Subject: [PATCH 03/12] fix sidebars

Signed-off-by: haroldsphinx <adedayoakinpelu@gmail.com>
---
 docs/int/quickstart/advanced/monitoring-credentials.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md
index b2ae72f552..56630303c8 100644
--- a/docs/int/quickstart/advanced/monitoring-credentials.md
+++ b/docs/int/quickstart/advanced/monitoring-credentials.md
@@ -14,7 +14,7 @@ Ensure the following software are installed:
 - Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
 - Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
 
-# Push metrics to Obol Monitoring
+## Push metrics to Obol Monitoring
 
 :::info
 This is **optional** and does not confer any special privileges within the Obol Network.
@@ -110,7 +110,7 @@ For alerts specific to Charon/Alpha, refer to the alerting rules available on th
 
 ## Key metrics to watch to verify node health based on jobs
 
-### Node Exporter:
+- Node Exporter:
 
 **CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
 

From 05a7263e6c4892abc3c77276eff48a5158793d2d Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Thu, 10 Aug 2023 14:26:59 +0100
Subject: [PATCH 04/12] punctuation updated

---
 .../quickstart/advanced/monitoring-credentials.md    |  6 +++---
 docs/int/quickstart/monitoring.md                    | 12 ++++++------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md
index 56630303c8..9e65fcdf6b 100644
--- a/docs/int/quickstart/advanced/monitoring-credentials.md
+++ b/docs/int/quickstart/advanced/monitoring-credentials.md
@@ -14,7 +14,7 @@ Ensure the following software are installed:
 - Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
 - Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
 
-## Push metrics to Obol Monitoring
+## Push Metrics to Obol Monitoring
 
 :::info
 This is **optional** and does not confer any special privileges within the Obol Network.
@@ -59,7 +59,7 @@ scrape_configs:
 
 - Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
 
-## Example alerting rules
+## Example Alerting Rules
 
 To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page:
 
@@ -76,7 +76,7 @@ To create alerts for Node-Exporter, follow these steps based on the sample rules
 
 For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules).
 
-## Understanding Alert rules
+## Understanding Alert Rules
 
 1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
 2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
diff --git a/docs/int/quickstart/monitoring.md b/docs/int/quickstart/monitoring.md
index bd7152f3a1..e8bd5fe473 100644
--- a/docs/int/quickstart/monitoring.md
+++ b/docs/int/quickstart/monitoring.md
@@ -37,17 +37,17 @@ Ensure the following software are installed:
 
 ## Understanding Alert rules
 
-1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
-2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
-3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
-4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
+1. `AlphaClusterBeaconNodeDown`: This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
+2. `AlphaClusterBeaconNodeSyncing`: This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
+3. `AlphaClusterNodeDown`: This alert is activated when a node in a specified Alpha cluster is offline.
+4. `AlphaClusterMissedAttestations`: This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
 5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
-6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
+6. `AlphaClusterInsufficientPeers`: This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
 7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
 8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
 9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
 10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0.
-11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
+11. `PeerPingLatency: 1`: This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
 
 ## ****Best Practices for Monitoring Charon Nodes & Cluster****
 

From 738d05deccfc3288bfd826dbd1feb126d14a4a78 Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Wed, 16 Aug 2023 17:28:44 +0100
Subject: [PATCH 05/12] updated sidebar, new page for push metrics added

---
 docs/int/quickstart/advanced/push-metrics | 40 +++++++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 docs/int/quickstart/advanced/push-metrics

diff --git a/docs/int/quickstart/advanced/push-metrics b/docs/int/quickstart/advanced/push-metrics
new file mode 100644
index 0000000000..8d9e0ceca1
--- /dev/null
+++ b/docs/int/quickstart/advanced/push-metrics
@@ -0,0 +1,40 @@
+---
+sidebar_position: 5
+description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
+---
+
+# Push Metrics to Obol Monitoring
+
+:::info
+This is **optional** and does not confer any special privileges within the Obol Network.
+:::
+
+You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.
+
+The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like:
+```
+obol20!tnt8U!C...
+```
+
+The updated `prometheus/prometheus.yml` file should look like:
+```
+global:
+  scrape_interval:     30s # Set the scrape interval to every 30 seconds.
+  evaluation_interval: 30s # Evaluate rules every 30 seconds.
+
+remote_write:
+  - url: https://vm.monitoring.gcp.obol.tech/write
+    authorization:
+      credentials: obol20!tnt8U!C...
+
+scrape_configs:
+  - job_name: 'charon'
+    static_configs:
+      - targets: ['charon:3620']
+  - job_name: "lodestar"
+    static_configs:
+      - targets: [ "lodestar:5064" ]
+  - job_name: 'node-exporter'
+    static_configs:
+      - targets: ['node-exporter:9100']
+```
\ No newline at end of file

From 0f382560d3a6e464e6dbc2b8d509e9f90ad4bfdb Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Wed, 16 Aug 2023 17:29:06 +0100
Subject: [PATCH 06/12] updated sidebar and new page for push metrics added

---
 .../quickstart/advanced/adv-docker-configs.md |  2 +-
 .../{advanced => }/monitoring-credentials.md  |  1 +
 docs/int/quickstart/monitoring.md             | 94 -------------------
 .../quickstart/advanced/adv-docker-configs.md |  2 +-
 .../int/quickstart/advanced/prysm-vc.md       |  2 +-
 .../int/quickstart/advanced/self-relay.md     |  2 +-
 6 files changed, 5 insertions(+), 98 deletions(-)
 rename docs/int/quickstart/{advanced => }/monitoring-credentials.md (99%)
 delete mode 100644 docs/int/quickstart/monitoring.md

diff --git a/docs/int/quickstart/advanced/adv-docker-configs.md b/docs/int/quickstart/advanced/adv-docker-configs.md
index 8a85d9f122..d14de53e8b 100644
--- a/docs/int/quickstart/advanced/adv-docker-configs.md
+++ b/docs/int/quickstart/advanced/adv-docker-configs.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 8
 description: Use advanced docker-compose features to have more flexibility and power to change the default configuration.
 ---
 
diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/monitoring-credentials.md
similarity index 99%
rename from docs/int/quickstart/advanced/monitoring-credentials.md
rename to docs/int/quickstart/monitoring-credentials.md
index 9e65fcdf6b..9ce2929f9e 100644
--- a/docs/int/quickstart/advanced/monitoring-credentials.md
+++ b/docs/int/quickstart/monitoring-credentials.md
@@ -53,6 +53,7 @@ scrape_configs:
 ## Import Pre-Configured Charon Dashboards
 
 - Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
+
 - In your Grafana interface, create a new dashboard and select the import option.
 
 - Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
diff --git a/docs/int/quickstart/monitoring.md b/docs/int/quickstart/monitoring.md
deleted file mode 100644
index e8bd5fe473..0000000000
--- a/docs/int/quickstart/monitoring.md
+++ /dev/null
@@ -1,94 +0,0 @@
-# Getting Started Monitoring your Node
-
-Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
-
-## Pre-requisites
-
-Ensure the following software are installed:
-
-- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
-- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
-- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
-
-## Import Pre-Configured Charon Dashboards
-
-- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
-- In your Grafana interface, create a new dashboard and select the import option.
-
-![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png)
-
-- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
-
-![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png)
-
-- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
-
-![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png)
-
-## Example alerting rules
-
-- Alerts for Node-Exporter can be created using the sample rules provided here
-
-[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)
-
-- For Charon/Alpha alerts, refer to the alerting rules available
-
-[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules)
-
-## Understanding Alert rules
-
-1. `AlphaClusterBeaconNodeDown`: This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
-2. `AlphaClusterBeaconNodeSyncing`: This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
-3. `AlphaClusterNodeDown`: This alert is activated when a node in a specified Alpha cluster is offline.
-4. `AlphaClusterMissedAttestations`: This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
-5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
-6. `AlphaClusterInsufficientPeers`: This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
-7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
-8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
-9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
-10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0.
-11. `PeerPingLatency: 1`: This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
-
-## ****Best Practices for Monitoring Charon Nodes & Cluster****
-
-- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
-- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
-- **Configure Alerts**: Based on these metrics, set up actionable alerts.
-- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
-- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
-- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
-- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
-- **Automate Monitoring**: Use automation to ensure no issues go undetected.
-- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
-- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
-
-## ****Third-Party Services for Uptime Testing****
-
-- [updown.io](https://updown.io/)
-- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE)
-
-## **Key metrics to watch to verify node health based on jobs**
-
-**node_exporter:**
-
-**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
-
-**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
-
-**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
-
-**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
-
-**Disk Space**: Running out of disk space can lead to application errors and data loss.
-
-**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
-
-**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
-
-**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
-
-It is also important to check:
-
-- NTP clock skew
-- Process restarts and failures (eg. through `node_systemd`)
-- alert on high error and panic log counts.
\ No newline at end of file
diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md
index c58bae2359..7c99383c22 100644
--- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md
+++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/adv-docker-configs.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 5
+sidebar_position: 6
 description: Use advanced docker-compose features to have more flexibility and power to change the default configuration.
 ---
 
diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md
index 50e9b349fe..c79b57b374 100644
--- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md
+++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/prysm-vc.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 6
+sidebar_position: 7
 description: Run Prysm VCs in a DV
 ---
 
diff --git a/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md b/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md
index ae157214b7..dfe6042d23 100644
--- a/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md
+++ b/versioned_docs/version-v0.16.0/int/quickstart/advanced/self-relay.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 7
+sidebar_position: 8
 description: Self-host a relay
 ---
 

From 58e5de8d57c92f31bc96367bf7003398febe05d4 Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Thu, 17 Aug 2023 09:53:25 +0100
Subject: [PATCH 07/12] push-metrics updated to push-metrics.md

---
 docs/int/quickstart/advanced/{push-metrics => push-metrics.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/int/quickstart/advanced/{push-metrics => push-metrics.md} (100%)

diff --git a/docs/int/quickstart/advanced/push-metrics b/docs/int/quickstart/advanced/push-metrics.md
similarity index 100%
rename from docs/int/quickstart/advanced/push-metrics
rename to docs/int/quickstart/advanced/push-metrics.md

From 1b6f8e52b3fa004de708244af3a5855f66bcaae0 Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Thu, 17 Aug 2023 09:55:12 +0100
Subject: [PATCH 08/12] updated location of monitoring credentials

---
 docs/int/quickstart/{ => advanced}/monitoring-credentials.md | 0
 docs/int/quickstart/advanced/quickstart-combine.md           | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename docs/int/quickstart/{ => advanced}/monitoring-credentials.md (100%)

diff --git a/docs/int/quickstart/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md
similarity index 100%
rename from docs/int/quickstart/monitoring-credentials.md
rename to docs/int/quickstart/advanced/monitoring-credentials.md
diff --git a/docs/int/quickstart/advanced/quickstart-combine.md b/docs/int/quickstart/advanced/quickstart-combine.md
index ce6058e593..8d1025c641 100644
--- a/docs/int/quickstart/advanced/quickstart-combine.md
+++ b/docs/int/quickstart/advanced/quickstart-combine.md
@@ -1,5 +1,5 @@
 ---
-sidebar_position: 8
+sidebar_position: 9
 description: Combine distributed validator private key shares to recover the validator private key.
 ---
 

From 608a518b6d4c975ea31298ec2e7cbcae4d5c0249 Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Thu, 17 Aug 2023 10:13:03 +0100
Subject: [PATCH 09/12] push metrics section removed from monitoring page

---
 .../advanced/monitoring-credentials.md        | 36 -------------------
 1 file changed, 36 deletions(-)

diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md
index 9ce2929f9e..fdbec169b9 100644
--- a/docs/int/quickstart/advanced/monitoring-credentials.md
+++ b/docs/int/quickstart/advanced/monitoring-credentials.md
@@ -14,42 +14,6 @@ Ensure the following software are installed:
 - Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
 - Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
 
-## Push Metrics to Obol Monitoring
-
-:::info
-This is **optional** and does not confer any special privileges within the Obol Network.
-:::
-
-You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.
-
-The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like:
-```
-obol20!tnt8U!C...
-```
-
-The updated `prometheus/prometheus.yml` file should look like:
-```
-global:
-  scrape_interval:     30s # Set the scrape interval to every 30 seconds.
-  evaluation_interval: 30s # Evaluate rules every 30 seconds.
-
-remote_write:
-  - url: https://vm.monitoring.gcp.obol.tech/write
-    authorization:
-      credentials: obol20!tnt8U!C...
-
-scrape_configs:
-  - job_name: 'charon'
-    static_configs:
-      - targets: ['charon:3620']
-  - job_name: "lodestar"
-    static_configs:
-      - targets: [ "lodestar:5064" ]
-  - job_name: 'node-exporter'
-    static_configs:
-      - targets: ['node-exporter:9100']
-```
-
 ## Import Pre-Configured Charon Dashboards
 
 - Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.

From a3085d28c2823b7f78a49deb23a455f1e7703c52 Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Thu, 31 Aug 2023 15:07:02 +0100
Subject: [PATCH 10/12] Push metric page URL slug changed to 'monitoring'

---
 docs/int/quickstart/advanced/{push-metrics.md => monitoring.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/int/quickstart/advanced/{push-metrics.md => monitoring.md} (100%)

diff --git a/docs/int/quickstart/advanced/push-metrics.md b/docs/int/quickstart/advanced/monitoring.md
similarity index 100%
rename from docs/int/quickstart/advanced/push-metrics.md
rename to docs/int/quickstart/advanced/monitoring.md

From 0d4905f53c3271608ac04799858ad17693b8ebfe Mon Sep 17 00:00:00 2001
From: Maeliosa <maeliosa@obol.tech>
Date: Wed, 6 Sep 2023 11:45:43 +0100
Subject: [PATCH 11/12] slugs changed for monitoring docs

---
 .../advanced/monitoring-credentials.md        | 100 -------------
 docs/int/quickstart/advanced/monitoring.md    | 132 +++++++++++++-----
 .../quickstart/advanced/obol-monitoring.md    |  40 ++++++
 3 files changed, 136 insertions(+), 136 deletions(-)
 delete mode 100644 docs/int/quickstart/advanced/monitoring-credentials.md
 create mode 100644 docs/int/quickstart/advanced/obol-monitoring.md

diff --git a/docs/int/quickstart/advanced/monitoring-credentials.md b/docs/int/quickstart/advanced/monitoring-credentials.md
deleted file mode 100644
index fdbec169b9..0000000000
--- a/docs/int/quickstart/advanced/monitoring-credentials.md
+++ /dev/null
@@ -1,100 +0,0 @@
----
-sidebar_position: 4
-description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
----
-# Getting Started Monitoring your Node
-
-Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
-
-## Pre-requisites
-
-Ensure the following software are installed:
-
-- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
-- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
-- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
-
-## Import Pre-Configured Charon Dashboards
-
-- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
-
-- In your Grafana interface, create a new dashboard and select the import option.
-
-- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
-
-- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
-
-## Example Alerting Rules
-
-To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page:
-
-1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services.
-   
-2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category.
-   
-3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules.
-   
-4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts.
-   
-5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules.
-
-
-For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules).
-
-## Understanding Alert Rules
-
-1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
-2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
-3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
-4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
-5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
-6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
-7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
-8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
-9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
-10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0.
-11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
-
-## Best Practices for Monitoring Charon Nodes & Cluster
-
-- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
-- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
-- **Configure Alerts**: Based on these metrics, set up actionable alerts.
-- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
-- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
-- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
-- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
-- **Automate Monitoring**: Use automation to ensure no issues go undetected.
-- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
-- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
-
-## Third-Party Services for Uptime Testing
-
-- [updown.io](https://updown.io/)
-- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/)
-
-## Key metrics to watch to verify node health based on jobs
-
-- Node Exporter:
-
-**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
-
-**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
-
-**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
-
-**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
-
-**Disk Space**: Running out of disk space can lead to application errors and data loss.
-
-**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
-
-**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
-
-**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
-
-It is also important to check:
-
-- NTP clock skew
-- Process restarts and failures (eg. through `node_systemd`)
-- alert on high error and panic log counts.
\ No newline at end of file
diff --git a/docs/int/quickstart/advanced/monitoring.md b/docs/int/quickstart/advanced/monitoring.md
index 8d9e0ceca1..fdbec169b9 100644
--- a/docs/int/quickstart/advanced/monitoring.md
+++ b/docs/int/quickstart/advanced/monitoring.md
@@ -1,40 +1,100 @@
 ---
-sidebar_position: 5
+sidebar_position: 4
 description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
 ---
+# Getting Started Monitoring your Node
 
-# Push Metrics to Obol Monitoring
-
-:::info
-This is **optional** and does not confer any special privileges within the Obol Network.
-:::
-
-You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.
-
-The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like:
-```
-obol20!tnt8U!C...
-```
-
-The updated `prometheus/prometheus.yml` file should look like:
-```
-global:
-  scrape_interval:     30s # Set the scrape interval to every 30 seconds.
-  evaluation_interval: 30s # Evaluate rules every 30 seconds.
-
-remote_write:
-  - url: https://vm.monitoring.gcp.obol.tech/write
-    authorization:
-      credentials: obol20!tnt8U!C...
-
-scrape_configs:
-  - job_name: 'charon'
-    static_configs:
-      - targets: ['charon:3620']
-  - job_name: "lodestar"
-    static_configs:
-      - targets: [ "lodestar:5064" ]
-  - job_name: 'node-exporter'
-    static_configs:
-      - targets: ['node-exporter:9100']
-```
\ No newline at end of file
+Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
+
+## Pre-requisites
+
+Ensure the following software are installed:
+
+- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
+- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
+- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
+
+## Import Pre-Configured Charon Dashboards
+
+- Navigate to the **[repository](https://github.com/ObolNetwork/monitoring/tree/main/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
+
+- In your Grafana interface, create a new dashboard and select the import option.
+
+- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
+
+- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
+
+## Example Alerting Rules
+
+To create alerts for Node-Exporter, follow these steps based on the sample rules provided on the "Awesome Prometheus alerts" page:
+
+1. Visit the **[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)** page. Here, you will find lists of Prometheus alerting rules categorized by hardware, system, and services.
+   
+2. Depending on your need, select the category of alerts. For example, if you want to set up alerts for your system's CPU usage, click on the 'CPU' under the 'Host & Hardware' category.
+   
+3. On the selected page, you'll find specific alert rules like 'High CPU Usage'. Each rule will provide the PromQL expression, alert name, and a brief description of what the alert does. You can copy these rules.
+   
+4. Paste the copied rules into your Prometheus configuration file under the `rules` section. Make sure you understand each rule before adding it to avoid unnecessary alerts.
+   
+5. Finally, save and apply the configuration file. Prometheus should now trigger alerts based on these rules.
+
+
+For alerts specific to Charon/Alpha, refer to the alerting rules available on this [ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules).
+
+## Understanding Alert Rules
+
+1. `ClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
+2. `ClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
+3. `ClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
+4. `ClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
+5. `ClusterInUnknownStatus`: This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
+6. `ClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
+7. `ClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
+8. `ClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
+9. `ClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
+10. `ClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the cluster falls to 0.
+11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
+
+## Best Practices for Monitoring Charon Nodes & Cluster
+
+- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
+- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
+- **Configure Alerts**: Based on these metrics, set up actionable alerts.
+- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
+- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
+- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
+- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
+- **Automate Monitoring**: Use automation to ensure no issues go undetected.
+- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
+- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
+
+## Third-Party Services for Uptime Testing
+
+- [updown.io](https://updown.io/)
+- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/)
+
+## Key metrics to watch to verify node health based on jobs
+
+- Node Exporter:
+
+**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
+
+**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
+
+**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
+
+**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
+
+**Disk Space**: Running out of disk space can lead to application errors and data loss.
+
+**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
+
+**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
+
+**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
+
+It is also important to check:
+
+- NTP clock skew
+- Process restarts and failures (eg. through `node_systemd`)
+- alert on high error and panic log counts.
\ No newline at end of file
diff --git a/docs/int/quickstart/advanced/obol-monitoring.md b/docs/int/quickstart/advanced/obol-monitoring.md
new file mode 100644
index 0000000000..8d9e0ceca1
--- /dev/null
+++ b/docs/int/quickstart/advanced/obol-monitoring.md
@@ -0,0 +1,40 @@
+---
+sidebar_position: 5
+description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
+---
+
+# Push Metrics to Obol Monitoring
+
+:::info
+This is **optional** and does not confer any special privileges within the Obol Network.
+:::
+
+You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.
+
+The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like:
+```
+obol20!tnt8U!C...
+```
+
+The updated `prometheus/prometheus.yml` file should look like:
+```
+global:
+  scrape_interval:     30s # Set the scrape interval to every 30 seconds.
+  evaluation_interval: 30s # Evaluate rules every 30 seconds.
+
+remote_write:
+  - url: https://vm.monitoring.gcp.obol.tech/write
+    authorization:
+      credentials: obol20!tnt8U!C...
+
+scrape_configs:
+  - job_name: 'charon'
+    static_configs:
+      - targets: ['charon:3620']
+  - job_name: "lodestar"
+    static_configs:
+      - targets: [ "lodestar:5064" ]
+  - job_name: 'node-exporter'
+    static_configs:
+      - targets: ['node-exporter:9100']
+```
\ No newline at end of file

From adf04f867549d167185b37f89af426bc8492c754 Mon Sep 17 00:00:00 2001
From: thomasheremans <th.heremans@gmail.com>
Date: Mon, 25 Sep 2023 13:21:58 +0100
Subject: [PATCH 12/12] resolve v0.17 missing

---
 ...onitoring-credentials.md => monitoring.md} |  0
 .../quickstart/advanced/obol-monitoring.md    |  0
 .../int/quickstart/advanced/push-metrics.md   | 40 -------------------
 3 files changed, 40 deletions(-)
 rename versioned_docs/version-v0.17.0/int/quickstart/advanced/{monitoring-credentials.md => monitoring.md} (100%)
 rename docs/int/quickstart/advanced/push-metrics.md => versioned_docs/version-v0.17.0/int/quickstart/advanced/obol-monitoring.md (100%)
 delete mode 100644 versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md

diff --git a/versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring-credentials.md b/versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring.md
similarity index 100%
rename from versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring-credentials.md
rename to versioned_docs/version-v0.17.0/int/quickstart/advanced/monitoring.md
diff --git a/docs/int/quickstart/advanced/push-metrics.md b/versioned_docs/version-v0.17.0/int/quickstart/advanced/obol-monitoring.md
similarity index 100%
rename from docs/int/quickstart/advanced/push-metrics.md
rename to versioned_docs/version-v0.17.0/int/quickstart/advanced/obol-monitoring.md
diff --git a/versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md b/versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md
deleted file mode 100644
index 8d9e0ceca1..0000000000
--- a/versioned_docs/version-v0.17.0/int/quickstart/advanced/push-metrics.md
+++ /dev/null
@@ -1,40 +0,0 @@
----
-sidebar_position: 5
-description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
----
-
-# Push Metrics to Obol Monitoring
-
-:::info
-This is **optional** and does not confer any special privileges within the Obol Network.
-:::
-
-You may have been provided with **Monitoring Credentials** used to push distributed validator metrics to Obol's central prometheus cluster to monitor, analyze, and improve your Distributed Validator Cluster's performance.
-
-The provided credentials needs to be added in `prometheus/prometheus.yml` replacing `$PROM_REMOTE_WRITE_TOKEN` and will look like:
-```
-obol20!tnt8U!C...
-```
-
-The updated `prometheus/prometheus.yml` file should look like:
-```
-global:
-  scrape_interval:     30s # Set the scrape interval to every 30 seconds.
-  evaluation_interval: 30s # Evaluate rules every 30 seconds.
-
-remote_write:
-  - url: https://vm.monitoring.gcp.obol.tech/write
-    authorization:
-      credentials: obol20!tnt8U!C...
-
-scrape_configs:
-  - job_name: 'charon'
-    static_configs:
-      - targets: ['charon:3620']
-  - job_name: "lodestar"
-    static_configs:
-      - targets: [ "lodestar:5064" ]
-  - job_name: 'node-exporter'
-    static_configs:
-      - targets: ['node-exporter:9100']
-```
\ No newline at end of file