-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add visibility to cron failures in production #22368
Comments
@mostlikelee Thanks for filing this. I am prioritizing to the drafting board for estimation and assigning to @sharon-fdm. Where are you thinking we report failures to? #help-p2? This is definitely something we need visibility into. |
#help-p2 seems reasonable or possibly through datadog metrics, or a combination of both. |
Hey @lukeheath, this engineering initiated story and #22367 didn't make the 3-week drafting => estimation timeline. Do we want to carry it over both the next design sprint? If so, I think we can leave them on the board. |
@sharon-fdm It looks like this missed estimation. Would you please estimate async with the team or on a meeting today? That way we'll know if we have room for it in the next sprint. |
@noahtalerman, @lukeheath, the way we think this should be done is not allow each server access to our slack, but instead we already have one central location that collects stuff from servers which is the Heroku. The idea is:
For estimation:
|
Timebox 1 point @eashaw to check if we already pick up cron job failures as part of the existing statistics. |
@sharon-fdm Great idea! Using the existing statistics mechanism makes sense. |
Next steps:
Discovery estimation: 1 pt |
I think we want to take a different approach here since we only want to support our managed cloud instances with this feature. We can setup CloudWatch notifications like we use for 500's, but instead we can watch for cron failures in the logs, and trigger a notification, which won't require any code changes (except making sure all cron failures log a standardized message). I'm going to close this issue in favor of #19930 but feel free to re-open if I'm not understanding the intent properly. |
Cron job failures seen, |
Goal
Context
Cron jobs tend to fail silently for different reasons in different environments with varying impacts. The following issues are examples of recent customer impacting cron failures where we were not alerted:
#22366
#22364
#21292
Ideally we would like to have visibility into failures on cloud hosted AND self hosted environments.
Changes
Product
Engineering
QA
Risk assessment
Manual testing steps
Testing notes
Confirmation
The text was updated successfully, but these errors were encountered: