Job fails with Unexpected Error after ECONNRESET #859

schnuri-schnuri · 2024-10-09T07:50:33Z

Hello,

In our validation environment, a job failed with the message "an unexpected error occurred while executing job" with payload
{ error: "read ECONNRESET", name: <redacted>, type: "executing job failed"}

Since then, the job has not been executed anymore despite being scheduled for every 15 minutes.

Currently, I cannot provide much more information as I first have to adjust the service's log level, but will update.

The text was updated successfully, but these errors were encountered:

schnuri-schnuri · 2024-10-09T09:56:59Z

Redeployment fixed the stuck execution.

YaniKolev · 2024-10-09T12:26:32Z

If the job is defined with maxRunning = 1 and for some reason the executions field of the scheduler is not properly updated in the DB with inc: -1 for that particular job after a failed run, this is actually correct.

Momo then thinks there is currently a job running and does not try to run it again, as maxRunning is reached. I was able to reproduce this.

So the real question is why momo failed to update the executions.

YaniKolev · 2024-10-09T12:28:24Z

And I guess a service restart fixes this, since the scheduler is reregistered with a clean executions field.

YaniKolev · 2024-10-18T06:33:57Z

Unfortunately, we've been unable to reproduce the issue. My suspicion is that for some reason the MongoDB operation itself failed, after which momo's behaviour was just correct. Perhaps we can close this and reopen it, should you experience something similar in the future? Then we can likely get more information via logs or something and go from there.

weissu42 · 2024-10-18T07:28:11Z

I guess the question is, how we can make momo more stable. If there really was a mongo error that prevented us from updating the executions and then momo got stuck believing that the job is running forever and never needs to be started again. How could we have recovered from that?

Would some kind of timeout for jobs be useful (probably the user of momo would have to define that as part of the job since momo has no chance of guessing what a reasonable execution time is for a job). After the timeout we assume the job is dead and clean up?

Or - if I understand correctly, the application was just logging an error once and then continued to run as if nothing was wrong, right? - would it be better if momo stated clearer that is got stuck in some error case? Spamming the log with errors so alarms have a good chance to trigger or whatever? :D Would that have been useful?

YaniKolev · 2024-10-18T07:38:55Z

I was thinking about letting momo build a "job profile" - e.g. how long each job runs and then killing any job that deviates too much from that. Or maybe just spamming logs.

The timeout you suggest is also an option, but it is a bit of guessowork on the user's part. Maybe we can start with adding an error log when momo tries to schedule a job but max running is already reached? That's a weird case anyway, I suppose, so we should probably report it somehow.

YaniKolev · 2024-10-23T10:43:21Z

We spend quite a lot of time looking into this. From our perspective this was caused by a mongo issue that was correctly reported in the logging.

We don't feel it's momo's responsibility to mitigate such issues. Nevertheless, we'll look into improving our logging with our next release to make debugging this easier in future.

schnuri-schnuri · 2024-10-23T13:06:24Z

Thank you for looking into this :)

set3mu · 2025-01-31T06:22:36Z

We encountered the same issue with one of our services. Our monitoring system can easily detect the problem; however, it is unfortunate that it currently requires human intervention, such as an application restart, to resolve the issue.

A potential solution could involve developing a watchdog to monitor job executions and force a restart of the scheduler if a job is not executing as expected. Nevertheless, it seems counterintuitive to build such infrastructure on top of Momo that should inherently handle this.

Would it be possible to implement a retry mechanism for updates of job executions in case they fail? Alternatively, consider introducing an optional timeout parameter, after which a job would be considered non-running.

#859

Signed-off-by: Yani Kolev <yani.kolev@tngtech.com> #859

YaniKolev closed this as completed Oct 23, 2024

YaniKolev reopened this Feb 7, 2025

YaniKolev added a commit that referenced this issue Feb 7, 2025

feat: minor refactorings, fix tests

d624c36

#859

YaniKolev added a commit that referenced this issue Feb 7, 2025

feat: minor refactorings, fix tests

34b5a6e

Signed-off-by: Yani Kolev <yani.kolev@tngtech.com> #859

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job fails with Unexpected Error after ECONNRESET #859

Job fails with Unexpected Error after ECONNRESET #859

schnuri-schnuri commented Oct 9, 2024 •

edited

Loading

schnuri-schnuri commented Oct 9, 2024

YaniKolev commented Oct 9, 2024

YaniKolev commented Oct 9, 2024

YaniKolev commented Oct 18, 2024

weissu42 commented Oct 18, 2024 •

edited

Loading

YaniKolev commented Oct 18, 2024

YaniKolev commented Oct 23, 2024

schnuri-schnuri commented Oct 23, 2024

set3mu commented Jan 31, 2025

Job fails with Unexpected Error after ECONNRESET #859

Job fails with Unexpected Error after ECONNRESET #859

Comments

schnuri-schnuri commented Oct 9, 2024 • edited Loading

schnuri-schnuri commented Oct 9, 2024

YaniKolev commented Oct 9, 2024

YaniKolev commented Oct 9, 2024

YaniKolev commented Oct 18, 2024

weissu42 commented Oct 18, 2024 • edited Loading

YaniKolev commented Oct 18, 2024

YaniKolev commented Oct 23, 2024

schnuri-schnuri commented Oct 23, 2024

set3mu commented Jan 31, 2025

schnuri-schnuri commented Oct 9, 2024 •

edited

Loading

weissu42 commented Oct 18, 2024 •

edited

Loading