This page should indicate what monitoring exists (and maybe more importantly, what doesn't and should). Issues should be created in the operations repo when gaps in monitoring are found.
A few availability tests exist hitting the prod api cluster, but I don't see any place where they are used to notify that something is broken.
- Monitor Azure Search indexer for errors
- Monitor Redis for max capacity
- Monitor CosmosDB for any problems or errors
- Monitor crawlers for high usage of disk space and CPU
- Number of messages in the queues
- Failed curation PR checks