lvmbeat provides two different utilities:
- As an actor, it functions as a middleware for "heartbeat" messages between different elements within the LVM operations system and the ECP dome system.
- It also includes a service that runs outside LCO and receives a heartbeat from the observatory, alerting when it does not, which indicates a problem with LCO connecting to the internet or a major networking issue.
The dome PLC includes a heartbeat feature. If a certain Modbus variable has not been set for a minute by an external source, the PLC understand that the high level software is not running properly and it will close the dome.
lvmbeat provides a middleware that monitors all the heartbeats from critical components that must be running for the dome to be operated in robotic mode (currently only the Overwatcher, but potentially others like the lvm-api). Every few seconds, lvmbeat reviews all the heartbeats that it has received in the last and, if all the necessary heartbeats are present, it emits an lvmecp heartbeat command that will write the PLC heartbeat command, keeping the dome open. The following diagram shows the architecture of the system:
Note that the lvmbeat actor does not have a loop that emits the heartbeat to the ECP on a loop. Heartbeats are emitted when the lvmbeat actor receives a set command from one of the clients. A cut-off time is in place to avoid sending commands to lvmecp too frequently.
In addition to this, lvmbeat regularly calls an API route outside LCO to indicate that it is alive and that the connection from the observatory to the internet is working.
The configuration file config.yaml looks like
actor:
name: lvmbeat
host: localhost
port: 5672
log_dir: /data/logs/lvmbeat
timeout: 30
outside_monitor:
url: null
interval: 15
send_email_after: 300
heartbeats:
- name: overwatcher
critical: truetimeout sets how long after one of the critical heartbeat has not been received, the lvmbeat actor will stop emitting commands to lvmecp.
To add a new heartbeat, simply add it to the heartbeats section and then have the client responsible for it send lvmbeat set <heartbeat_name> every few seconds (but more frequently than timeout).
The outside_monitor section defines the outside URL that the actor will call every outside_monitor.interval. send_email_after indicates how long the external monitor service will wait after it has not received a heartbeat to send an alert email. The url value must be replaced with the URL of the external monitor service. The URL can also be defined by setting the environment variable LVMBEAT_OUTSIDE_MONITOR_URL.
lvmbeat status
16:32:29.736 lvmbeat : {
"heartbeats": {
"overwatcher": {
"current": false,
"last_seen": null
}
},
"last_emitted_ecp": null,
"last_emitted_outside": "2024-12-26T16:32:25.935047Z",
"network": {
"lco": true,
"outside": true
}
}The actor accepts a status command that returns the last time each one of the heartbeats was received, the last time heartbeats were emitted to lvmecp and to the external monitor. Finally, it reports whether the connection to the observatory (from the LVM subnet) and the outside world is working.
The external monitor runs outside LCO. It provides a simple API that lets the lvmbeat actor running at LCO indicate that it is alive and can connect. When this connection has not been established for a certain amount of time, the external monitor sends an email to the LVM team alerting that the connection to LCO may be down.
The external monitor must run in such a way that it can provide a semi-static ingress point to the application, and must have access to an SMTP server to send the emails. We are currently deploying the monitor as an Azure container app.
The following routes are available in the API:
GET /heartbeat: sets the internal heartbeat. Usually called by thelvmbeatactor from LCO.GET /status: returns the last time the heartbeat was received, whether the alert is active, and whether the monitor is enabled.GET /enable: enables the monitor.GET /disable: disables the monitor. When disabled, the monitor will keep track of the heartbeats but will not send an alert email.GET /email-test: sends a test alert email. Useful to test the SMTP server configuration.GET /version: returns the version of thelvmbeatcode.
There are three elements in the heartbeat ecosystem that need to be deployed: the lvmbeat actor, the lvmbeat external monitor, and the lvmecp actor. In principle they can be deployed in any order, but unless done quickly this is likely to cause false alert emails and the dome to close (if open).
The following sequence is the recommended deployment order:
- Close the dome.
- Deploy the
lvmecpactor. - Deploy the
lvmbeatactor. - Deploy the external monitor service.
- If the ingress host to the external monitor does not match the one in the
lvmbeatconfiguration, siable the external monitor, update the configuration, restart the actor, and enable the external monitor again.
The actors are deployed in the LVM Kubernetes cluster like other actors and no special considerations must be taken into account. However, lvmecp will use lvmopstools to emit Slack notifications and the correct value for the $LVMOPSTOOLS_CONFIG_FILE environment variable must be set in the deployment.
This instructions assume that we are deploying the external monitor in an Azure cloud environment. The app can also be deployed in a self-hosted environment with the same procedure, as long as an SMTP server is available.
Before deploying the app, a number of services need to be set up in Azure. This should be a one-off operation. The majority of these operations can be performed either through the Azure portal or the Azure CLI.
-
Create a resource group. This is a logical container for resources in Azure. The following command creates a resource group named
sdssin thewestusregion:az group create --name sdss --location westus
-
Create a communication email resource with an Azure-provided email domain, add a communication resource, and register an application. All that is explained here. Note that you'll also need to follow the steps in the prerequisites section (the client secret is not necessary) and to create a secret in the Microsoft Entra application. After this, the SMTP details will be:
- Host:
smtp.azurecomm.net - Port: 587
- TLS: yes
- Username:
<Azure Communication Services Resource name>.<Microsoft Entra Application ID>.<Microsoft Entra Tenant ID> - Password: the value of the secret created in the Microsoft Entra application
- Host:
-
Build the container image. The
lvmbeatrepository workflows build two images:lvmbeatandlvmbeat-monitor. Thelvmbeatimage is the actor that runs at LCO and thelvmbeat-monitoris the external monitor. Both are pushed the theghcr.io/sdssregistry. -
Deploy the application as a container app in Azure. You can do it from the portal, but this command should work:
az containerapp up \ --name lco-monitor \ --image ghcr.io/sdss/lvmbeat-monitor \ --ingress external \ --target-port 80 \ --resource-group <resource-group> \ --env-vars LVMBEAT_EMAIL_RECIPIENTS="LVM Critical Alerts <lvm-critical@sdss.org>" \ LVMBEAT_EMAIL_FROM_ADDRESS="LVM Critical Alerts <do-not-reply-address>" \ LVMBEAT_EMAIL_REPLY_TO="lvm-critical@sdss.org" \ LVMBEAT_EMAIL_HOST="smtp.azurecomm.net" \ LVMBEAT_EMAIL_PORT="587" \ LVMBEAT_EMAIL_TLS="1" \ LVMBEAT_EMAIL_USERNAME="<smtp-username>" \ LVMBEAT_EMAIL_PASSWORD="<smtp-password>" \ LVMBEAT_SEND_EMAIL_AFTER="300"where you must replace
<smtp-username>and<smtp-password>with the values obtained from the Azure Communication Services resource. The<do-not-reply-address>email can be found in theMailFrom addressessections of the provisioned domain in the email communication service.When the service finishes deploying it will return the ingress URL. It can also be found in the Ingress section of the container app in the Azure portal.
-
Test the service by calling the
/statusroute:curl -L -X GET http://<ingress-url>/status
The response should be something like
{ "last_seen": "2024-12-26T16:32:25.935047Z", "enabled": true, "active": false }You can temporarily disable the monitor by calling the
/disableroute. -
Test the SMTP configuration by calling the
/email-testroute:curl -L -X GET http://<ingress-url>/email-test
You should receive an email via the
lvm-critical@sdss.orgmailing list. -
Make sure the monitor is enabled and that the
lvmbeatactor at LCO is calling the/heartbeatroute. -
You may want to use the Azure portal to make sure the deployment can only create one replica of the service, although it never seems to create more than one.
If you need to redeploy the app, you can simply delete the container app with
az containerapp delete --name lco-monitor --resource-group <resource-group>and then redeploy it with the az containerapp up command. Alternatively you can use the update command which will create a new revision and then switch the traffic to it when ready.
az containerapp update --name lco-monitor --image ghcr.io/sdss/lvmbeat-monitor:latest --resource-group <resource-group> --max-replicas 1This will work if you're chaning something in your deployment (for example changing the tag of the image), but if you're just trying to re-pull the latest image, the container won't be restarted. To do that you'll need to go to the Azure portal and restart the revision, or do
az containerapp revision list -n lco-monitor -g <resource-group> -o tableto get the revision number and then
az containerapp revision restart -n lco-monitor -g <resource-group> --revision <revision-number>More details on how to manage revisions can be found here.
The ingress URL remains constant only as long as the revision does not change, which means that in most cases when something is actually changed in the deployment, the URL will change.
