Skip to content

Releases: m-lab/prometheus-support

Weekly Release

07 Nov 13:58
bf35449
Compare
Choose a tag to compare

Changes include:

  • Increased RAM in sandbox & staging for prometheus servers. Should be unchanged in production.
  • Add cAdvisor DaemonSet to prometheus federation cluster
  • Adds local alertmanager configuration for development & testing
  • Adds much improved alert template for slack (including dashboard, github, and silence buttons)
  • Dashboard improvements for Gardener, Batch pipeline, and Prometheus self-monitoring.
  • Dashboards are auto-deployed upon merges to master.
  • Several additional alerts are now conditional on GMX metrics.

All production dashboards are static; Adds new Gardener alert

17 Oct 18:26
a6611c5
Compare
Choose a tag to compare

This release includes static renderings of all dashboards in mlab-oti. Going forward all dashboards in mlab-oti should be static.

This release also includes a new alert for Gardener processing: Gardener_ParseTimeDifferenceTooOld

Dashboard updates in production

10 Oct 21:29
2f13f02
Compare
Choose a tag to compare

Adds two dashboards:

  • Pipeline: Gardener
  • Pipeline: Parser

And fixes NDT: Early Warning

Upgrades Prometheus to 2.4.x

09 Oct 16:02
15de345
Compare
Choose a tag to compare

The major feature of this release is an upgrade from the 1.8.x version of Prometheus to 2.4.x.

There are also a number of smaller changes and improvements. A couple notable changes:

  • blackbox_exporter probes of the services running behind our k8s nginx ingress such as prometheus, grafana, alertmanager and GMX.
  • configmap-reloader now uses internal service names instead of public ones.

Moves (most) everything into the default namespace

19 Sep 20:02
70a0a11
Compare
Choose a tag to compare

There was an apparent issue with the external-dns deployment in which the deployment was in the default namespace but the RBAC roles was in the external-dns namespace, causing the deployment to fail. The failure of the external-dns deployment in turn was causing the failure of the GitHub Maintenance Exporter deployment (no DNS records were being created). Consensus seems to be that, for now, namespaces are more trouble than they are worth in the prometheus-federation cluster. This PR puts most everything in the default namespace.

One-off bug fix release

18 Sep 17:04
0cb0fba
Compare
Choose a tag to compare

This release includes a fix to a bug introduced in the previous release which was causing floods of spurious alerts to fire for LameDuckMetricMissingForNode alert.

Weekly release: 2018-09-10 to 2018-09-18

18 Sep 16:03
db47ce2
Compare
Choose a tag to compare

This release introduces a new k8s deployment, service and ingress for the Github Maintenance Exporter.

Weekly release: 2018-08-28 to 2018-09-10

10 Sep 16:51
82fab1b
Compare
Choose a tag to compare

This release features:

  • A number of improvements to alerting, including fixes for some existing alerts to make them less noisy, plus some new alerts.
  • Updates data-processing-cluster's Prometheus instance to v2.3.2.
  • Adds a new BQ exporter query to check for completeness of NDT test annotations.

Weekly release: 2018-08-23 to 2018-08-28

28 Aug 13:54
49ea91f
Compare
Choose a tag to compare

Including a typo fix for the ParserDailyVolumeTooLow dashboard.
Increasing the timeout for the SnmpScrapingDownAtSite alert to 60m.

Weekly release: 2018-08-14 to 2018-08-23

23 Aug 18:22
0625ff0
Compare
Choose a tag to compare

This release increases the default RAM allocated to prometheus in mlab-oti and increases the cache index flag parameters to improve interactive query support.

As well, the ParserDailyVolumeTooLow alert is now built on a recording rule that should make the evaluation much more efficient.