Add visibility to cron failures in production #22368

mostlikelee · 2024-09-25T13:02:03Z

Goal

User story
As a Fleet engineer,
I want to be alerted when failures occur in Fleet server cron jobs
so that I can be proactive in resolving customer issues before they are reported.

Context

Requestor(s): @mostlikelee
Product designer: _________________________

Cron jobs tend to fail silently for different reasons in different environments with varying impacts. The following issues are examples of recent customer impacting cron failures where we were not alerted:
#22366
#22364
#21292

Ideally we would like to have visibility into failures on cloud hosted AND self hosted environments.

Changes

Product

Engineering

Feature guide changes: TODO
Database schema migrations: TODO
Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

Engineer (@____): Added comment to user story confirming successful completion of QA.
QA (@____): Added comment to user story confirming successful completion of QA.

lukeheath · 2024-09-25T14:54:25Z

@mostlikelee Thanks for filing this. I am prioritizing to the drafting board for estimation and assigning to @sharon-fdm.

Where are you thinking we report failures to? #help-p2?

This is definitely something we need visibility into.

mostlikelee · 2024-09-25T15:15:03Z

#help-p2 seems reasonable or possibly through datadog metrics, or a combination of both.

noahtalerman · 2024-10-04T13:36:16Z

Hey @lukeheath, this engineering initiated story and #22367 didn't make the 3-week drafting => estimation timeline.

Do we want to carry it over both the next design sprint? If so, I think we can leave them on the board.

lukeheath · 2024-10-04T15:49:40Z

@sharon-fdm It looks like this missed estimation. Would you please estimate async with the team or on a meeting today? That way we'll know if we have room for it in the next sprint.

sharon-fdm · 2024-10-07T16:13:59Z

@noahtalerman, @lukeheath, the way we think this should be done is not allow each server access to our slack, but instead we already have one central location that collects stuff from servers which is the Heroku.

The idea is:

Servers will send cron failures as statistics.
Heroku will count the failures and if exist will use a Slack app to forward to our Slack (#help_p2)

For estimation:

Change stats frequency fro 1 per day to 1 per 3 hours and report cron failures
Build slack app (2 points)
Heroku to accomulate the errors and use the slack app to notify

sharon-fdm · 2024-10-09T18:57:09Z

Timebox 1 point @eashaw to check if we already pick up cron job failures as part of the existing statistics.

lukeheath · 2024-10-10T00:01:33Z

@sharon-fdm Great idea! Using the existing statistics mechanism makes sense.

RachelElysia · 2024-10-16T18:22:41Z

Next steps:

Timebox BE - Confirm what is being sent to Heroku / Eric: Check in with Heroku if cron job fails, are the statistics sent to Heroku?
Decide how to define the cron failures to Heroku
Heroku will count the failures and if exist will use a Slack app to forward to our Slack (#help_p2)
Change stats frequency fro 1 per day to 1 per 3 hours and report cron failures
Build slack app (2 points)
Heroku to accomulate the errors and use the slack app to notify

Discovery estimation: 1 pt

lukeheath · 2024-11-05T23:19:20Z

I think we want to take a different approach here since we only want to support our managed cloud instances with this feature. We can setup CloudWatch notifications like we use for 500's, but instead we can watch for cron failures in the logs, and trigger a notification, which won't require any code changes (except making sure all cron failures log a standardized message).

I'm going to close this issue in favor of #19930 but feel free to re-open if I'm not understanding the intent properly.

fleet-release · 2024-11-05T23:19:22Z

Cron job failures seen,
Clouds clear for engineers' gaze,
Issues fixed unseen.

mostlikelee added story A user story defining an entire feature ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. labels Sep 25, 2024

mostlikelee assigned lukeheath Sep 25, 2024

lukeheath added :product Product Design department (shows up on 🦢 Drafting board) ~eng-priority Engineering-initiated story that was prioritized. labels Sep 25, 2024

lukeheath assigned sharon-fdm and unassigned lukeheath Sep 25, 2024

sharon-fdm added the #g-endpoint-ops Endpoint ops product group label Oct 4, 2024

lukeheath closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add visibility to cron failures in production #22368

Add visibility to cron failures in production #22368

mostlikelee commented Sep 25, 2024 •

edited

Loading

lukeheath commented Sep 25, 2024 •

edited

Loading

mostlikelee commented Sep 25, 2024

noahtalerman commented Oct 4, 2024 •

edited

Loading

lukeheath commented Oct 4, 2024

sharon-fdm commented Oct 7, 2024

sharon-fdm commented Oct 9, 2024

lukeheath commented Oct 10, 2024

RachelElysia commented Oct 16, 2024

lukeheath commented Nov 5, 2024

fleet-release commented Nov 5, 2024

Add visibility to cron failures in production #22368

Add visibility to cron failures in production #22368

Comments

mostlikelee commented Sep 25, 2024 • edited Loading

Goal

Context

Changes

Product

Engineering

QA

Risk assessment

Manual testing steps

Testing notes

Confirmation

lukeheath commented Sep 25, 2024 • edited Loading

mostlikelee commented Sep 25, 2024

noahtalerman commented Oct 4, 2024 • edited Loading

lukeheath commented Oct 4, 2024

sharon-fdm commented Oct 7, 2024

sharon-fdm commented Oct 9, 2024

lukeheath commented Oct 10, 2024

RachelElysia commented Oct 16, 2024

lukeheath commented Nov 5, 2024

fleet-release commented Nov 5, 2024

mostlikelee commented Sep 25, 2024 •

edited

Loading

lukeheath commented Sep 25, 2024 •

edited

Loading

noahtalerman commented Oct 4, 2024 •

edited

Loading