Releases: m-lab/prometheus-support
New alerts for collectd-mlab metrics.
We recently added a metric for whether collectd-mlab is healthy on nodes. This release simply adds two new alerts for when collectd-mlab is either down or missing.
Monitor blackbox_exporter instances (correctly)
A previous release attempted to implement monitoring of our blackbox_exporter instances, but did so incorrectly. This release fixes that previous incorrect implementation. Additionally, it turns of service auto-discovery for the BBE instance running in the prometheus-federation k8s cluster in favor of manually specifying a target rule in the Prom configs. Doing things makes configurations for the IPv4 and IPv6 BBE instances more or less the same, since auto-discovery won't work for the IPv6 instance running on a Linode VM.
Alerts for experiment metrics, alerts for BBE, bugfix in 1 dashboard
The bulk of this release is new Prometheus alerts. We now have alerting for:
- blackbox_exporter metrics that mlab-ns now relies on.
- alerts for the blackbox_exporter instances themselves.
- alerts for a new instance of node_exporter than is running on eb.measurementlab.net
Additionally this release contains:
- A new Prometheus scrape job which will scrape a node_exporter instance on EB.
- A bugfix to the Ops_PlatformOverview Grafana dashboard.
Weekly release: Retire status.* URLs for mlab-oti
This release completes the turndown of the legacy status.* URLs for the Prometheus, Grafana, and Alertmanager stack.
All projects and all services should now be accessed via their TLS names.
Weekly release: Add Ops Overview dashboard & Alertmanager with Basic Auth
This release includes:
Grafana Updates / Fixes:
- Add dashboard -- Ops: Platform Overview
- Increase the nginx ingress's proxy-connect-timeout to exceed the Prometheus query timeout. Should fix "Gateway Timeout" errors.
Alerts Changes
- Alertmanager links sent to slack will have basic auth credentials embedded. So, clicking on those links should "just work" without prompting for a username / password.
- Adds new alert case for NagiosExporterUnavailable
- Updates ParserDailyVolumeTooLow to only count rows that use status="ok".
Weekly release: TLS & Basic Auth
- Add TLS & Basic Auth to Grafana, Prometheus, and Alertmanager
- Add monitoring for data-processing-cluster and minimal alerts on etl-gardener
- Fix table names for bigquery-exporter queries
- Update nagios exporter alerts to cover both deployments of nagios
- Adds additional IPv6 targets to prometheus
IPv6 monitoring
The principal change in this release is the addition of IPv6 monitoring. Since GCP doesn't currently support IPv6 for most applications, monitoring is enabled via a remote (Linode in this case) VM running several Docker instances (one for each GCP project) of the Prometheus blackbox_exporter.
Blackbox_exporter probes now timeout at 9s instead of 5s
This is a small release that does one principal thing: it changes the timeout for all blackbox_exporter probes to 9s. Previously, many were at 5s, which is likely not enough for some of our less well provisioned sites in far flung places. Indeed, for some of those 9s might not even be enough, but nearly doubling time current timeout will be an improvement.
The one other small change is that "Ops: Platform Overview" Grafana dashboard link was updated in alerts.yml.
Weekly release: 2018-02-06 to 2018-02-13
The release brings three new changes:
-
Scraping of the Prom node_exporter instances running in the script_exporter and snmp_exporter GCE instances.
-
New alerts for the script_exporter job and metrics.
-
Imports the JSON for the Grafana dashboard "Ops: Switch Overview".
For the full list of changes, see the diff between this release and the last.
Weekly release: 2018-01-29 to 2018-02-06
- Adds a new JSON Grafana dashboard for the paris traceroute pipeline: Pipeline_PT.json
- Adds 4 new Prometheus recording rules for switch discard metrics.
- Re-merges the
script-exporter
scrape jobs such that all script-exporter targets will get scraped at 1m intervals again. For the ndt_e2e script, the mitigation to avoid end-to-end testing every minute is for the ndt_e2e script to cache the result and only refresh the cache every 10 minutes, unless the service is down, in which case it will retest every probe (i.e., every minute). - Adds an explicit version to the
gcp-service-discovery
Docker image (v.1.0)
See the change details by viewing the CS between the previous release and this one.