Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add visibility to cron failures in production #22368

Closed
15 tasks
mostlikelee opened this issue Sep 25, 2024 · 10 comments
Closed
15 tasks

Add visibility to cron failures in production #22368

mostlikelee opened this issue Sep 25, 2024 · 10 comments
Assignees
Labels
~eng-priority Engineering-initiated story that was prioritized. ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature

Comments

@mostlikelee
Copy link
Contributor

mostlikelee commented Sep 25, 2024

Goal

User story
As a Fleet engineer,
I want to be alerted when failures occur in Fleet server cron jobs
so that I can be proactive in resolving customer issues before they are reported.

Context

  • Requestor(s): @mostlikelee
  • Product designer: _________________________

Cron jobs tend to fail silently for different reasons in different environments with varying impacts. The following issues are examples of recent customer impacting cron failures where we were not alerted:
#22366
#22364
#21292

Ideally we would like to have visibility into failures on cloud hosted AND self hosted environments.

Changes

Product

  • UI changes: TODO
  • CLI (fleetctl) usage changes: TODO
  • YAML changes: TODO
  • REST API changes: TODO
  • Fleet's agent (fleetd) changes: TODO
  • Activity changes: TODO
  • Permissions changes: TODO
  • Changes to paid features or tiers: TODO
  • Other reference documentation changes: TODO
  • Once shipped, requester has been notified

Engineering

  • Feature guide changes: TODO
  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: TODO
  • Risk level: Low / High TODO
  • Risk description: TODO

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@mostlikelee mostlikelee added story A user story defining an entire feature ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. labels Sep 25, 2024
@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) ~eng-priority Engineering-initiated story that was prioritized. labels Sep 25, 2024
@lukeheath lukeheath assigned sharon-fdm and unassigned lukeheath Sep 25, 2024
@lukeheath
Copy link
Member

lukeheath commented Sep 25, 2024

@mostlikelee Thanks for filing this. I am prioritizing to the drafting board for estimation and assigning to @sharon-fdm.

Where are you thinking we report failures to? #help-p2?

This is definitely something we need visibility into.

@mostlikelee
Copy link
Contributor Author

#help-p2 seems reasonable or possibly through datadog metrics, or a combination of both.

@noahtalerman
Copy link
Member

noahtalerman commented Oct 4, 2024

Hey @lukeheath, this engineering initiated story and #22367 didn't make the 3-week drafting => estimation timeline.

Do we want to carry it over both the next design sprint? If so, I think we can leave them on the board.

@lukeheath
Copy link
Member

@sharon-fdm It looks like this missed estimation. Would you please estimate async with the team or on a meeting today? That way we'll know if we have room for it in the next sprint.

@sharon-fdm sharon-fdm added the #g-endpoint-ops Endpoint ops product group label Oct 4, 2024
@sharon-fdm
Copy link
Collaborator

@noahtalerman, @lukeheath, the way we think this should be done is not allow each server access to our slack, but instead we already have one central location that collects stuff from servers which is the Heroku.

The idea is:

  • Servers will send cron failures as statistics.
  • Heroku will count the failures and if exist will use a Slack app to forward to our Slack (#help_p2)

For estimation:

  • Change stats frequency fro 1 per day to 1 per 3 hours and report cron failures

  • Build slack app (2 points)

  • Heroku to accomulate the errors and use the slack app to notify

@sharon-fdm
Copy link
Collaborator

Timebox 1 point @eashaw to check if we already pick up cron job failures as part of the existing statistics.

@lukeheath
Copy link
Member

@sharon-fdm Great idea! Using the existing statistics mechanism makes sense.

@RachelElysia
Copy link
Member

Next steps:

  1. Timebox BE - Confirm what is being sent to Heroku / Eric: Check in with Heroku if cron job fails, are the statistics sent to Heroku?
  2. Decide how to define the cron failures to Heroku
  3. Heroku will count the failures and if exist will use a Slack app to forward to our Slack (#help_p2)
  4. Change stats frequency fro 1 per day to 1 per 3 hours and report cron failures
  5. Build slack app (2 points)
  6. Heroku to accomulate the errors and use the slack app to notify

Discovery estimation: 1 pt

@lukeheath
Copy link
Member

I think we want to take a different approach here since we only want to support our managed cloud instances with this feature. We can setup CloudWatch notifications like we use for 500's, but instead we can watch for cron failures in the logs, and trigger a notification, which won't require any code changes (except making sure all cron failures log a standardized message).

I'm going to close this issue in favor of #19930 but feel free to re-open if I'm not understanding the intent properly.

@fleet-release
Copy link
Contributor

Cron job failures seen,
Clouds clear for engineers' gaze,
Issues fixed unseen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
~eng-priority Engineering-initiated story that was prioritized. ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Development

No branches or pull requests

6 participants