-
Notifications
You must be signed in to change notification settings - Fork 1
Monitoring and Alerting
This section is used to provide information on how the monitoring and alerting works in the Cloud Platform
The hale-platform collects alerts in a single slack channel #website-builder-alerts
Pingdom is setup for all live sites hosted on the hale-platform. The list of URLs being monitored are implemented using terraform and can be found here.
To add a site to Pingdom you require the following information.
- The name identifying the site
- The host of the site e.g "google.co.uk"
- The url of the site e.g "/images" (if using the root, use "/")
Browse to the pingdom.tf and add a new line at the bottom of the locals block and enter like the example below.
"magistrates-recruitment" = { host = "magistrates.judiciary.uk", url = "/" }
If you would like to remove a site from Pingdom. Remove the line that corresponds to the site you would like to remove. Raise a PR with the Cloud Platform Team and your changes will take effect.
We currently use Prometheus to monitor the infrastructure thats runs the service that is currently hosted in Cloud Platform. We are monitoring each of the namespaces for any disruption to the pods whether that is crash looping, not enough memory/cpu or have a mismatched deployment. If any of these alarms are triggered, an alert will be sent to the slack channel detailing the affected namespace and pod for investigation.
To configure the alerts in prometheus, each namespace in the cloud platform contains a prometheus yaml file that specifies each event that is monitored. New events can be added to this file and when PR with the platform, terraform will apply the new alerts.
We use GOV.UK notify to support our sending of emails from the platform for things like password resets. It is essential for our service to know of any issues to this in a timely fashion.
GOV.UK notify offers us to receive alerts via a webhook so we created a new one and provided this to GOV.UK notify and so now if there is any service disruption, we will get an alert in our alerts channel.