-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job fails with Unexpected Error after ECONNRESET #859
Comments
Redeployment fixed the stuck execution. |
If the job is defined with Momo then thinks there is currently a job running and does not try to run it again, as So the real question is why momo failed to update the |
And I guess a service restart fixes this, since the scheduler is reregistered with a clean |
Unfortunately, we've been unable to reproduce the issue. My suspicion is that for some reason the MongoDB operation itself failed, after which momo's behaviour was just correct. Perhaps we can close this and reopen it, should you experience something similar in the future? Then we can likely get more information via logs or something and go from there. |
I guess the question is, how we can make momo more stable. If there really was a mongo error that prevented us from updating the Would some kind of timeout for jobs be useful (probably the user of momo would have to define that as part of the job since momo has no chance of guessing what a reasonable execution time is for a job). After the timeout we assume the job is dead and clean up? Or - if I understand correctly, the application was just logging an error once and then continued to run as if nothing was wrong, right? - would it be better if momo stated clearer that is got stuck in some error case? Spamming the log with errors so alarms have a good chance to trigger or whatever? :D Would that have been useful? |
I was thinking about letting momo build a "job profile" - e.g. how long each job runs and then killing any job that deviates too much from that. Or maybe just spamming logs. The timeout you suggest is also an option, but it is a bit of guessowork on the user's part. Maybe we can start with adding an error log when momo tries to schedule a job but max running is already reached? That's a weird case anyway, I suppose, so we should probably report it somehow. |
We spend quite a lot of time looking into this. From our perspective this was caused by a mongo issue that was correctly reported in the logging. We don't feel it's momo's responsibility to mitigate such issues. Nevertheless, we'll look into improving our logging with our next release to make debugging this easier in future. |
Thank you for looking into this :) |
We encountered the same issue with one of our services. Our monitoring system can easily detect the problem; however, it is unfortunate that it currently requires human intervention, such as an application restart, to resolve the issue. A potential solution could involve developing a watchdog to monitor job executions and force a restart of the scheduler if a job is not executing as expected. Nevertheless, it seems counterintuitive to build such infrastructure on top of Momo that should inherently handle this. Would it be possible to implement a retry mechanism for updates of job executions in case they fail? Alternatively, consider introducing an optional timeout parameter, after which a job would be considered non-running. |
Signed-off-by: Yani Kolev <yani.kolev@tngtech.com> #859
Hello,
In our validation environment, a job failed with the message "an unexpected error occurred while executing job" with payload
{ error: "read ECONNRESET", name: <redacted>, type: "executing job failed"}
Since then, the job has not been executed anymore despite being scheduled for every 15 minutes.
Currently, I cannot provide much more information as I first have to adjust the service's log level, but will update.
The text was updated successfully, but these errors were encountered: