Alert somehow when there are cron failures in managed cloud #19930

roperzh · 2024-06-21T15:12:11Z

Problem

Cron failures in managed cloud do not report to the #help-p1 or #help-p2 channels, which means we're not aware of them unless issues are reported by the customer.

Solution

High-level: add a new errors column to the cron_stats table, populate this with any errors that occur during a cron run if they bubble up to the job runner level (i.e. they cause the job to fail) and then use the existing monitor Lambda to report on these errors hourly.

Details

On the server side:

Add errors field to cron_stats table (json DEFAULT NULL)
Add errors var to Schedule struct to collect errors from jobs
- Errors should be mapped by job, since a schedule can run several jobs
In RunAllJobs, collect err from job (same place we currently log) into new errors var
Update Schedule.updateStats and CronStats.UpdateCronStats to accept errors argument
If provided, update errors field of cron_stats table

On the monitor side:

Add new SQL query to look for all completed schedules since last run with non-null errors
send SNS with job ID, name, errors
“Since last run” could be just now - monitor interval (i.e. in the last hour), but ideally we query AWS for last successful run of the Lambda so we don't miss anything if the Lambda stalls or fails for some reason.
Add new SNS topic to Terraform to allow sending messages to help-p2 since these are not likely to require immediate action.

Other notes:

Errors will continued to logged to Cloudwatch as they happen (with “cron” key in the log)
We had discussed a new failed status for jobs, but now that I know that a schedule can have multiple jobs so that some may error and some not, I don’t think this would be as helpful.

The text was updated successfully, but these errors were encountered:

roperzh · 2024-06-21T15:15:56Z

cc: @lukeheath @georgekarrv

lukeheath · 2024-06-21T20:42:51Z

@roperzh Thanks for submitting this and good idea and seems necessary. I'm prioritizing to the drafting board for estimation.

lukeheath · 2024-06-21T20:48:12Z

@georgekarrv I'm prioritizing this for estimation. If it requires assistance from infra please loop in the necessary folks.

@sharon-fdm Do y'all have any cron job alerts that should be generating #help-p1 alerts? If so, please add them to this issue description so they can be included.

@rfairburn Heads up that you may get some questions about this. My primary goal is just to make sure that any cron alerts that should generate #help-p1 alerts (like SCEP certificate) do generate alerts.

Thanks all!

rfairburn · 2024-06-21T20:54:09Z

This should be possible, but an entirely different mechanism will need to be added to the terraform monitoring module.

I am thinking the pattern would look like this:

Cloudwatch subscription filter -> lambda function (to process) -> SNS Topic (alert #help-p1 slack)

I'd need to flesh the specifics out, but I think it is definitely possible as long as we have easy-to match patterns that don't have ambiguity in what I am matching.

This would be separate from our existing cron monitoring which just checks specific cron jobs for their completed status in the db.

sharon-fdm · 2024-06-21T22:48:07Z

@lukeheath I am not aware of any alerts mechanism for any of our cron jobs. @getvictor @mostlikelee @lucasmrod do you know otherwise?

Also, @mostlikelee do we have any failure notification when one of our vuln repos fail to do its job?

lukeheath · 2024-06-21T23:05:29Z

@sharon-fdm Yeah, I expect there isn't any hooked up yet. MDM ran into a case where the Fleet server knew the SCEP certificate was expired, but there was no mechanism to alert beyond server logs (from my 50k' view). Since that broke MDM functionality, we would have wanted to know about it in #help-p1.

I'm wondering if EO has anything like that, where an error that would otherwise be logged to the server is something that we'd actually like to know about ASAP in #help-p1. There may not be anything like that for EO since y'all don't deal with as many certs as MDM.

getvictor · 2024-06-22T13:33:51Z

@sharon-fdm Normal GitHub actions workflows can do a Slack notification on fail, like:

fleet/.github/workflows/test-go.yaml

Line 147 in 223e1f2

- name: Slack Notification

roperzh · 2024-06-24T13:03:36Z

Hey folks, sorry for not being clear (I created the issue in a hurry)

I updated the title now, as Luke mentions, this is to surface cron errors that happen in Cloud, as they currently don't have proper visibility (so we can't use GitHub actions, etc)

If you have any crons that are mission critical and want to surface alerts, this is the place to ask for them :)

mostlikelee · 2024-06-24T18:03:30Z

Also, @mostlikelee do we have any failure notification when one of our vuln repos fail to do its job?
yes, this alerts to #help-p2

I can't think of anything critical in EO, but I think it would be a good metric to monitor cron failure counts.

georgekarrv · 2024-06-26T16:19:26Z

Hey team! Please add your planning poker estimate with Zenhub @dantecatalfamo @ghernandez345 @gillespi314 @mna @roperzh @jahzielv

iansltx · 2024-08-12T15:24:47Z

Based on @ksatter in https://fleetdm.slack.com/archives/C051QJU3D0V/p1723414521461399?thread_ts=1723413693.587369&cid=C051QJU3D0V, we need to mark jobs as something other than "completed" (e.g. "failed") when they fail, at which point we can filter on that + cron in CloudWatch Logs Insights to get something that we can trigger alerting on.

If we had had that alerting on the vulnerabilities cron (including marking as failed), I think we would've caught #21239 Friday afternoon. If that's enough reason to pull this into the sprint, happy to take it as a way to learn more about that part of the codebase.

sgress454 · 2024-11-06T20:34:56Z

After reading the comments so far and looking at the current implementation, I have a slightly different proposal. I like the idea of a "failed" status for the cron jobs, but instead of trying to correlate a failed job with a log message, how about adding a new field to the cron_stats table (e.g. "failed_reason" or "notes")? Then we can update the existing monitoring Lambda to check for new failed jobs, and send an SNS message directly from the Lambda using the persisted failure reason. This Lambda already reads from the table and sends SNS messages so the new functionality would be in keeping with the existing use. The only gap I see is that the current setup only provides for configuring a single SNS topic which is directed to #help-p1, so if we wanted failures from certain jobs to go to #help-p2 we'd need to set up a new topic in Terraform and account for it in the Lambda code.

rfairburn · 2024-11-07T00:25:48Z

I like the idea of the db table recording the status. It would make processing in the lambda much easier.

We should still -also- include failures in the logs for those outside of cloud/aws not running the lambda, however.

lukeheath · 2024-11-19T22:22:23Z

From call with @sgress454:

Finding that there are a lot of cron errors that are going off constantly.
Most of these are not real errors, but things like noops that fail.
So we need to go through and review all the cron jobs and make sure they're erroring unless they really error.

for #19930 # Checklist for submitter - [X] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. - [X] Input data is properly validated, `SELECT *` is avoided, SQL injection is prevented (using placeholders for values in statements) - [X] Added/updated tests - [X] If database migrations are included, checked table schema to confirm autoupdate - [X] Manual QA for all new/changed functionality # Details This PR adds a new feature to the existing monitoring add-on. The add-on will now send an SNS alert whenever a scheduled job like "vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors. The alert contains the job type and the set of errors (there can be multiple, since jobs can have multiple sub-jobs). By default the SNS topic for this new alert is the same as the one for the existing cron system alerts, but it can be configured to use a separate topic (e.g. dogfood instance will post to a separate slack channel). The actual changes are: **On the server side:** - Add errors field to cron_stats table (json DEFAULT NULL) - Added errors var to `Schedule` struct to collect errors from jobs - In `RunAllJobs`, collect err from job into new errors var - Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept errors argument - If provided, update errors field of cron_stats table **On the monitor side:** - Add new SQL query to look for all completed schedules since last run with non-null errors - send SNS with job ID, name, errors # Testing New automated testing was added for the functional code that gathers and stores errors from cron runs in the database. To test the actual Lambda, I added a row in my `cron_stats` table with errors, then compiled and ran the Lambda executable locally, pointing it to my local mysql and localstack instances: ``` 2024/12/03 14:43:54 main.go:258: Lambda execution environment not found. Falling back to local execution. 2024/12/03 14:43:54 main.go:133: Connected to database! 2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC 2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed) 2024/12/03 14:43:54 main.go:70: Sending SNS Message 2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1' 2024/12/03 14:43:54 main.go:82: { MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f" } 2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"}) 2024/12/03 14:43:54 main.go:70: Sending SNS Message 2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run: {"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1' 2024/12/03 14:43:54 main.go:82: { MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"

lukeheath · 2025-01-03T22:25:21Z

@sgress454 I merged this and put it on the orchestration board for tracking. QA is just validating the alert works before closing the issue out. Thanks!

lukeheath · 2025-01-07T21:11:38Z

@sgress454 @rfairburn Just confirming this has been released to dogfood. I haven't seen any job failures reported, have we had any in dogfood?

Also, will this be naturally rolled out to all managed cloud instances at the next update? Any additional steps necessary to test this out?

sgress454 · 2025-01-07T21:16:54Z

This was on my TODO list for this afternoon. It's very possible that Dogfood doesn't have any failures at this time, but I'm going to look at the db and check manually for any rows with errors recorded.

Also, will this be naturally rolled out to all managed cloud instances at the next update? Any additional steps necessary to test this out?

In order to utilize this feature, instances will have to update to the latest monitoring plugin as we did with Dogfood in #24425. We can roll this out to managed instances as quickly or slowly as we like, but it won't happen automatically. We could start with one of the instances known to have had failures in the past.

sgress454 · 2025-01-07T21:19:13Z

I could also probably simulate (stimulate?) a failure on Dogfood in order to test this e2e.

rfairburn · 2025-01-08T00:40:45Z

We definitely should simulate a failure in Dogfood in order to validate this end to end before rolling to cloud.

rfairburn · 2025-01-08T00:42:33Z

Once we are confident in Dogfood, I would prefer to roll this out to the entire managed cloud at once for consistency sake.

The number of unique differences per tenant are too great as-is already.

sgress454 · 2025-01-08T20:45:48Z

We definitely should simulate a failure in Dogfood in order to validate this end to end before rolling to cloud.

Yup, the code worked but the SNS notification failed due to a permissions issue, I missed updating the IAM policies for the Lambda role.

2025/01/07 22:48:51 main.go:82: AuthorizationError: User: arn:aws:sts::160035666661:assumed-role/fleet-dogfood-cron-monitoring-lambda/fleet-dogfood_cron_monitoring is not authorized to perform: SNS:Publish on resource: arn:aws:sns:us-east-2:160035666661:fleet-dogfood-p2-alerts because no identity-based policy allows the SNS:Publish action

See #25268

sgress454 · 2025-01-29T16:24:29Z

Realized we never circled back to this -- we had a successful test of the alert on Dogfood, so I think the only thing required to deploy this to managed cloud customers would be to update their monitoring add-on version to the latest.

roperzh added the #g-customer-success Customer success issue. label Jun 21, 2024

roperzh added ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. story A user story defining an entire feature labels Jun 21, 2024

lukeheath added #g-mdm MDM product group :product Product Design department (shows up on 🦢 Drafting board) #g-endpoint-ops Endpoint ops product group #g-customer-success Customer success issue. and removed #g-customer-success Customer success issue. labels Jun 21, 2024

lukeheath assigned georgekarrv and sharon-fdm and unassigned sharon-fdm Jun 21, 2024

lukeheath removed #g-endpoint-ops Endpoint ops product group #g-customer-success Customer success issue. labels Jun 21, 2024

roperzh changed the title ~~Errors for important tasks that happen in cron jobs are not alerted~~ Surface errors from important cron jobs in Fleet Cloud Jun 24, 2024

roperzh changed the title ~~Surface errors from important cron jobs in Fleet Cloud~~ Alert errors from important cron jobs in Fleet Cloud in #help-p1 Jun 24, 2024

georgekarrv added the :demo label Jul 15, 2024

noahtalerman unassigned georgekarrv Jul 16, 2024

noahtalerman removed the :product Product Design department (shows up on 🦢 Drafting board) label Jul 16, 2024

georgekarrv removed the :demo label Aug 2, 2024

lukeheath mentioned this issue Nov 5, 2024

Add visibility to cron failures in production #22368

Closed

15 tasks

lukeheath assigned sgress454 Nov 5, 2024

lukeheath changed the title ~~Alert errors from important cron jobs in Fleet Cloud in #help-p1~~ Alert #help-p1 or #help-p2 when there are cron failures in managed cloud Nov 5, 2024

lukeheath changed the title ~~Alert #help-p1 or #help-p2 when there are cron failures in managed cloud~~ Alert somehow when there are cron failures in managed cloud Nov 19, 2024

sgress454 mentioned this issue Dec 3, 2024

Monitor and alert on errors in cron jobs #24347

Merged

5 tasks

lukeheath added #g-orchestration Orchestration product group and removed #g-mdm MDM product group labels Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert somehow when there are cron failures in managed cloud #19930

Alert somehow when there are cron failures in managed cloud #19930

roperzh commented Jun 21, 2024 •

edited by sgress454

Loading

roperzh commented Jun 21, 2024

lukeheath commented Jun 21, 2024

lukeheath commented Jun 21, 2024

rfairburn commented Jun 21, 2024

sharon-fdm commented Jun 21, 2024

lukeheath commented Jun 21, 2024

getvictor commented Jun 22, 2024

roperzh commented Jun 24, 2024

mostlikelee commented Jun 24, 2024

georgekarrv commented Jun 26, 2024

iansltx commented Aug 12, 2024

sgress454 commented Nov 6, 2024

rfairburn commented Nov 7, 2024

lukeheath commented Nov 19, 2024

lukeheath commented Jan 3, 2025

lukeheath commented Jan 7, 2025

sgress454 commented Jan 7, 2025

sgress454 commented Jan 7, 2025

rfairburn commented Jan 8, 2025

rfairburn commented Jan 8, 2025

sgress454 commented Jan 8, 2025

sgress454 commented Jan 29, 2025

Alert somehow when there are cron failures in managed cloud #19930

Alert somehow when there are cron failures in managed cloud #19930

Comments

roperzh commented Jun 21, 2024 • edited by sgress454 Loading

Problem

Solution

Details

roperzh commented Jun 21, 2024

lukeheath commented Jun 21, 2024

lukeheath commented Jun 21, 2024

rfairburn commented Jun 21, 2024

sharon-fdm commented Jun 21, 2024

lukeheath commented Jun 21, 2024

getvictor commented Jun 22, 2024

roperzh commented Jun 24, 2024

mostlikelee commented Jun 24, 2024

georgekarrv commented Jun 26, 2024

iansltx commented Aug 12, 2024

sgress454 commented Nov 6, 2024

rfairburn commented Nov 7, 2024

lukeheath commented Nov 19, 2024

lukeheath commented Jan 3, 2025

lukeheath commented Jan 7, 2025

sgress454 commented Jan 7, 2025

sgress454 commented Jan 7, 2025

rfairburn commented Jan 8, 2025

rfairburn commented Jan 8, 2025

sgress454 commented Jan 8, 2025

sgress454 commented Jan 29, 2025

roperzh commented Jun 21, 2024 •

edited by sgress454

Loading