Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job fails with Unexpected Error after ECONNRESET #859

Open
schnuri-schnuri opened this issue Oct 9, 2024 · 9 comments
Open

Job fails with Unexpected Error after ECONNRESET #859

schnuri-schnuri opened this issue Oct 9, 2024 · 9 comments

Comments

@schnuri-schnuri
Copy link

schnuri-schnuri commented Oct 9, 2024

Hello,

In our validation environment, a job failed with the message "an unexpected error occurred while executing job" with payload
{ error: "read ECONNRESET", name: <redacted>, type: "executing job failed"}

Since then, the job has not been executed anymore despite being scheduled for every 15 minutes.

Currently, I cannot provide much more information as I first have to adjust the service's log level, but will update.

@schnuri-schnuri
Copy link
Author

Redeployment fixed the stuck execution.

@YaniKolev
Copy link
Contributor

If the job is defined with maxRunning = 1 and for some reason the executions field of the scheduler is not properly updated in the DB with inc: -1 for that particular job after a failed run, this is actually correct.

Momo then thinks there is currently a job running and does not try to run it again, as maxRunning is reached. I was able to reproduce this.

So the real question is why momo failed to update the executions.

@YaniKolev
Copy link
Contributor

And I guess a service restart fixes this, since the scheduler is reregistered with a clean executions field.

@YaniKolev
Copy link
Contributor

Unfortunately, we've been unable to reproduce the issue. My suspicion is that for some reason the MongoDB operation itself failed, after which momo's behaviour was just correct. Perhaps we can close this and reopen it, should you experience something similar in the future? Then we can likely get more information via logs or something and go from there.

@weissu42
Copy link
Member

weissu42 commented Oct 18, 2024

I guess the question is, how we can make momo more stable. If there really was a mongo error that prevented us from updating the executions and then momo got stuck believing that the job is running forever and never needs to be started again. How could we have recovered from that?

Would some kind of timeout for jobs be useful (probably the user of momo would have to define that as part of the job since momo has no chance of guessing what a reasonable execution time is for a job). After the timeout we assume the job is dead and clean up?

Or - if I understand correctly, the application was just logging an error once and then continued to run as if nothing was wrong, right? - would it be better if momo stated clearer that is got stuck in some error case? Spamming the log with errors so alarms have a good chance to trigger or whatever? :D Would that have been useful?

@YaniKolev
Copy link
Contributor

I was thinking about letting momo build a "job profile" - e.g. how long each job runs and then killing any job that deviates too much from that. Or maybe just spamming logs.

The timeout you suggest is also an option, but it is a bit of guessowork on the user's part. Maybe we can start with adding an error log when momo tries to schedule a job but max running is already reached? That's a weird case anyway, I suppose, so we should probably report it somehow.

@YaniKolev
Copy link
Contributor

We spend quite a lot of time looking into this. From our perspective this was caused by a mongo issue that was correctly reported in the logging.

We don't feel it's momo's responsibility to mitigate such issues. Nevertheless, we'll look into improving our logging with our next release to make debugging this easier in future.

@schnuri-schnuri
Copy link
Author

Thank you for looking into this :)

@set3mu
Copy link

set3mu commented Jan 31, 2025

We encountered the same issue with one of our services. Our monitoring system can easily detect the problem; however, it is unfortunate that it currently requires human intervention, such as an application restart, to resolve the issue.

A potential solution could involve developing a watchdog to monitor job executions and force a restart of the scheduler if a job is not executing as expected. Nevertheless, it seems counterintuitive to build such infrastructure on top of Momo that should inherently handle this.

Would it be possible to implement a retry mechanism for updates of job executions in case they fail? Alternatively, consider introducing an optional timeout parameter, after which a job would be considered non-running.

@YaniKolev YaniKolev reopened this Feb 7, 2025
YaniKolev added a commit that referenced this issue Feb 7, 2025
YaniKolev added a commit that referenced this issue Feb 7, 2025
Signed-off-by: Yani Kolev <yani.kolev@tngtech.com>

#859
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants