Skip to content

Releases: m-lab/prometheus-support

New alerts for collectd-mlab metrics.

15 May 17:05
e5b8404
Compare
Choose a tag to compare

We recently added a metric for whether collectd-mlab is healthy on nodes. This release simply adds two new alerts for when collectd-mlab is either down or missing.

Monitor blackbox_exporter instances (correctly)

07 May 17:44
30505e0
Compare
Choose a tag to compare

A previous release attempted to implement monitoring of our blackbox_exporter instances, but did so incorrectly. This release fixes that previous incorrect implementation. Additionally, it turns of service auto-discovery for the BBE instance running in the prometheus-federation k8s cluster in favor of manually specifying a target rule in the Prom configs. Doing things makes configurations for the IPv4 and IPv6 BBE instances more or less the same, since auto-discovery won't work for the IPv6 instance running on a Linode VM.

Alerts for experiment metrics, alerts for BBE, bugfix in 1 dashboard

01 May 21:58
7592ef1
Compare
Choose a tag to compare

The bulk of this release is new Prometheus alerts. We now have alerting for:

  • blackbox_exporter metrics that mlab-ns now relies on.
  • alerts for the blackbox_exporter instances themselves.
  • alerts for a new instance of node_exporter than is running on eb.measurementlab.net

Additionally this release contains:

  • A new Prometheus scrape job which will scrape a node_exporter instance on EB.
  • A bugfix to the Ops_PlatformOverview Grafana dashboard.

Weekly release: Retire status.* URLs for mlab-oti

16 Apr 17:16
f50c3f6
Compare
Choose a tag to compare

This release completes the turndown of the legacy status.* URLs for the Prometheus, Grafana, and Alertmanager stack.

All projects and all services should now be accessed via their TLS names.

Weekly release: Add Ops Overview dashboard & Alertmanager with Basic Auth

09 Apr 15:53
d07ebce
Compare
Choose a tag to compare

This release includes:

Grafana Updates / Fixes:

  • Add dashboard -- Ops: Platform Overview
  • Increase the nginx ingress's proxy-connect-timeout to exceed the Prometheus query timeout. Should fix "Gateway Timeout" errors.

Alerts Changes

  • Alertmanager links sent to slack will have basic auth credentials embedded. So, clicking on those links should "just work" without prompting for a username / password.
  • Adds new alert case for NagiosExporterUnavailable
  • Updates ParserDailyVolumeTooLow to only count rows that use status="ok".

Weekly release: TLS & Basic Auth

02 Apr 19:35
5b2f02f
Compare
Choose a tag to compare
  • Add TLS & Basic Auth to Grafana, Prometheus, and Alertmanager
  • Add monitoring for data-processing-cluster and minimal alerts on etl-gardener
  • Fix table names for bigquery-exporter queries
  • Update nagios exporter alerts to cover both deployments of nagios
  • Adds additional IPv6 targets to prometheus

IPv6 monitoring

26 Mar 17:08
146d998
Compare
Choose a tag to compare

The principal change in this release is the addition of IPv6 monitoring. Since GCP doesn't currently support IPv6 for most applications, monitoring is enabled via a remote (Linode in this case) VM running several Docker instances (one for each GCP project) of the Prometheus blackbox_exporter.

Blackbox_exporter probes now timeout at 9s instead of 5s

12 Mar 20:54
705024b
Compare
Choose a tag to compare

This is a small release that does one principal thing: it changes the timeout for all blackbox_exporter probes to 9s. Previously, many were at 5s, which is likely not enough for some of our less well provisioned sites in far flung places. Indeed, for some of those 9s might not even be enough, but nearly doubling time current timeout will be an improvement.

The one other small change is that "Ops: Platform Overview" Grafana dashboard link was updated in alerts.yml.

Weekly release: 2018-02-06 to 2018-02-13

13 Feb 18:25
fd908e2
Compare
Choose a tag to compare

The release brings three new changes:

  • Scraping of the Prom node_exporter instances running in the script_exporter and snmp_exporter GCE instances.

  • New alerts for the script_exporter job and metrics.

  • Imports the JSON for the Grafana dashboard "Ops: Switch Overview".

For the full list of changes, see the diff between this release and the last.

Weekly release: 2018-01-29 to 2018-02-06

06 Feb 16:16
49f1671
Compare
Choose a tag to compare
  • Adds a new JSON Grafana dashboard for the paris traceroute pipeline: Pipeline_PT.json
  • Adds 4 new Prometheus recording rules for switch discard metrics.
  • Re-merges the script-exporter scrape jobs such that all script-exporter targets will get scraped at 1m intervals again. For the ndt_e2e script, the mitigation to avoid end-to-end testing every minute is for the ndt_e2e script to cache the result and only refresh the cache every 10 minutes, unless the service is down, in which case it will retest every probe (i.e., every minute).
  • Adds an explicit version to the gcp-service-discovery Docker image (v.1.0)

See the change details by viewing the CS between the previous release and this one.