Releases: m-lab/prometheus-support
Weekly Release
Changes include:
- Increased RAM in sandbox & staging for prometheus servers. Should be unchanged in production.
- Add cAdvisor DaemonSet to prometheus federation cluster
- Adds local alertmanager configuration for development & testing
- Adds much improved alert template for slack (including dashboard, github, and silence buttons)
- Dashboard improvements for Gardener, Batch pipeline, and Prometheus self-monitoring.
- Dashboards are auto-deployed upon merges to master.
- Several additional alerts are now conditional on GMX metrics.
All production dashboards are static; Adds new Gardener alert
This release includes static renderings of all dashboards in mlab-oti. Going forward all dashboards in mlab-oti should be static.
This release also includes a new alert for Gardener processing: Gardener_ParseTimeDifferenceTooOld
Dashboard updates in production
Adds two dashboards:
- Pipeline: Gardener
- Pipeline: Parser
And fixes NDT: Early Warning
Upgrades Prometheus to 2.4.x
The major feature of this release is an upgrade from the 1.8.x version of Prometheus to 2.4.x.
There are also a number of smaller changes and improvements. A couple notable changes:
- blackbox_exporter probes of the services running behind our k8s nginx ingress such as prometheus, grafana, alertmanager and GMX.
- configmap-reloader now uses internal service names instead of public ones.
Moves (most) everything into the default namespace
There was an apparent issue with the external-dns
deployment in which the deployment was in the default namespace but the RBAC roles was in the external-dns
namespace, causing the deployment to fail. The failure of the external-dns
deployment in turn was causing the failure of the GitHub Maintenance Exporter deployment (no DNS records were being created). Consensus seems to be that, for now, namespaces are more trouble than they are worth in the prometheus-federation cluster. This PR puts most everything in the default namespace.
One-off bug fix release
This release includes a fix to a bug introduced in the previous release which was causing floods of spurious alerts to fire for LameDuckMetricMissingForNode
alert.
Weekly release: 2018-09-10 to 2018-09-18
This release introduces a new k8s deployment, service and ingress for the Github Maintenance Exporter.
Weekly release: 2018-08-28 to 2018-09-10
This release features:
- A number of improvements to alerting, including fixes for some existing alerts to make them less noisy, plus some new alerts.
- Updates data-processing-cluster's Prometheus instance to v2.3.2.
- Adds a new BQ exporter query to check for completeness of NDT test annotations.
Weekly release: 2018-08-23 to 2018-08-28
Including a typo fix for the ParserDailyVolumeTooLow dashboard.
Increasing the timeout for the SnmpScrapingDownAtSite alert to 60m.
Weekly release: 2018-08-14 to 2018-08-23
This release increases the default RAM allocated to prometheus in mlab-oti and increases the cache index flag parameters to improve interactive query support.
As well, the ParserDailyVolumeTooLow alert is now built on a recording rule that should make the evaluation much more efficient.