-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Crashes when Proving with too Little RAM #305
Comments
We already catch the absence of return code gracefully in the proving job. That's why there is a message about it. However, we can never be certain of the reason for it. On unix, this just means that the process exited because of a signal, eg could be killed by user, out-of-mem, etc. On windows it may mean something else. There isn't a standard way to detect that a process was killed due to out-of-mem or by user. Given we can't be certain of the cause, retrying may well result in the same behavior, and create an infnite loop. True, we could retry up to n times, but this is really just complicating things and likely to waste cpu/time. I don't think retrying is the correct solution here. It might be that the proving process itself could catch the signal that it is being killed due to out-of-mem and return an exit code that neptune-core could interpret. But again, this will differ by operating system, and I don't believe there's any way to catch a kill -9 on unix(es), for example. It's not clean. Alternate proposal: The proving process could simply check available RAM when it starts, and set the necessary environment if it needs to. Or neptune-core, before invoking the prover. Either seems cleaner. If proving fails, so be it... no retry.
perhaps logging can be improved. |
Aha. So it was not clear to me that the message "proving process did not return any exit code" was in fact generated by The current response to this error case is a crash-and-dump-stack-trace, which I hope no-one thinks of as a graceful catch. If I read your message correctly, you are explaining that the proof job returns an error code (as it should) but the invoker, probably the mine-loop or a function in there that's responsible for composing, fails to catch it gracefully leading to a crash-and-dump-stack-trace. Would it be accurate to rephrase the todo as about percolating the graceful catch?
That's fair. As long as the error messages are helpful, the user themselves will retry. No need to automate the process.
The problem is that we don't really know the correct formula for the requisite amount of RAM to generate proof. It depends on a variety of factors, and we know more or less qualitatively how they affect RAM consumption. The best we have at this point is a heuristic. |
right. but I'm actually happy it is exiting so the error gets noticed rather than swallowing the error as it likely would've before. See #239.
yes that's correct. The proving job error struct has a specific error variant for that case, and the composing caller doesn't handle it, but rather turns it into a panic. Which tokio would simply swallow if we weren't using abort=panic right now. Here is the handling code:
yeah, open to suggestions about better log message. Just keep in mind the fundamental problem that we do not get any code or information that indicates the reason why the process exited without an exit code, and there can be different reasons, and it also varies by OS. So the message "proving process did not return any exit code" really conveys all we know at that point. If we were to change the message to eg "ran out of memory" then the log would be lying if the process is killed because user did a kill -9 on it, or some other reason. there may be something better we can do... I'm just trying to convey here it is not as simple as it may first appear. If you are interested in details, you can check out prover_job::VmProcessError, which has a variant for every possible error when invoking the external prover process. The calling method is ProverJob::prove_out_of_process().
fair enough. Still if we could determine a safe lower bound, then that could be used as the limit and seems like a reasonable solution for now... |
adding mainnet tag, as we shouldn't crash the node. Most people do not have enough ram to prove, and at least some will try anyway. They should get a decent error message. |
A gentleman on Telegram reports that his node crashes from time to time. His machine was configured to mine (composing+guessing) but he mentioned that he didn't succeed to mine any blocks yet. This is a stack trace from the last crash before we (think we) resolved the issue:
My reading was that the out-of-domain proving process is consuming too much memory and is killed by the operating system, resulting in no return code. Running with the flag
TVM_LDE_TRACE="no_cache"
seems to have fixed the issue. He even mentioned that he mined some blocks after this.I say "resolved" but I really mean "found a workaround to avoid crashing the node". Properly resolving this issue would be:
triton-vm
) automatically switch to memory-efficient code path when memory is insufficient (Issue 331).neptune-core
) catch the absence of return code gracefullyand, if not already set, set.TVM_LDE_TRACE="no_cache"
and retryneptune-core
) inform user with informative log messages.The text was updated successfully, but these errors were encountered: