Node Crashes when Proving with too Little RAM #305

aszepieniec · 2024-12-27T21:01:50Z

A gentleman on Telegram reports that his node crashes from time to time. His machine was configured to mine (composing+guessing) but he mentioned that he didn't succeed to mine any blocks yet. This is a stack trace from the last crash before we (think we) resolved the issue:

thread 'tokio-runtime-worker' panicked at /root/neptune-core/src/mine_loop.rs:573:26:
unexpected error while composing: external proving process failed

Caused by:
    proving process did not return any exit code

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: neptune_cash::mine_loop::mine::{{closure}}
   3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   4: tokio::runtime::task::harness::Harness<T,S>::poll
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.
Aborted (core dumped)

My reading was that the out-of-domain proving process is consuming too much memory and is killed by the operating system, resulting in no return code. Running with the flag TVM_LDE_TRACE="no_cache" seems to have fixed the issue. He even mentioned that he mined some blocks after this.

I say "resolved" but I really mean "found a workaround to avoid crashing the node". Properly resolving this issue would be:

(In triton-vm) automatically switch to memory-efficient code path when memory is insufficient (Issue 331).
(In neptune-core) catch the absence of return code gracefully ~~and, if not already set, set TVM_LDE_TRACE="no_cache" and retry~~.
(In neptune-core) inform user with informative log messages.

The text was updated successfully, but these errors were encountered:

dan-da · 2024-12-31T16:47:43Z

(In neptune-core) catch the absence of return code gracefully and, if not already set, set TVM_LDE_TRACE="no_cache" and retry.

We already catch the absence of return code gracefully in the proving job. That's why there is a message about it. However, we can never be certain of the reason for it. On unix, this just means that the process exited because of a signal, eg could be killed by user, out-of-mem, etc. On windows it may mean something else. There isn't a standard way to detect that a process was killed due to out-of-mem or by user.

Given we can't be certain of the cause, retrying may well result in the same behavior, and create an infnite loop. True, we could retry up to n times, but this is really just complicating things and likely to waste cpu/time. I don't think retrying is the correct solution here.

It might be that the proving process itself could catch the signal that it is being killed due to out-of-mem and return an exit code that neptune-core could interpret. But again, this will differ by operating system, and I don't believe there's any way to catch a kill -9 on unix(es), for example. It's not clean.

Alternate proposal: The proving process could simply check available RAM when it starts, and set the necessary environment if it needs to. Or neptune-core, before invoking the prover. Either seems cleaner. If proving fails, so be it... no retry.

(In neptune-core) inform user with informative log messages.

perhaps logging can be improved.

aszepieniec · 2025-01-01T21:03:39Z

We already catch the absence of return code gracefully in the proving job. That's why there is a message about it.

Aha. So it was not clear to me that the message "proving process did not return any exit code" was in fact generated by neptune-core and not by the operating system. In hindsight, the word "proving" is a dead giveaway.

The current response to this error case is a crash-and-dump-stack-trace, which I hope no-one thinks of as a graceful catch. If I read your message correctly, you are explaining that the proof job returns an error code (as it should) but the invoker, probably the mine-loop or a function in there that's responsible for composing, fails to catch it gracefully leading to a crash-and-dump-stack-trace. Would it be accurate to rephrase the todo as about percolating the graceful catch?

I don't think retrying is the correct solution here.

That's fair. As long as the error messages are helpful, the user themselves will retry. No need to automate the process.

Alternate proposal: The proving process could simply check available RAM when it starts, and set the necessary environment if it needs to.

The problem is that we don't really know the correct formula for the requisite amount of RAM to generate proof. It depends on a variety of factors, and we know more or less qualitatively how they affect RAM consumption. The best we have at this point is a heuristic.

dan-da · 2025-01-02T04:59:03Z

Aha. So it was not clear to me that the message "proving process did not return any exit code" was in fact generated by neptune-core and not by the operating system. In hindsight, the word "proving" is a dead giveaway.

The current response to this error case is a crash-and-dump-stack-trace, which I hope no-one thinks of as a graceful catch.

right. but I'm actually happy it is exiting so the error gets noticed rather than swallowing the error as it likely would've before. See #239.

If I read your message correctly, you are explaining that the proof job returns an error code (as it should) but the invoker, probably the mine-loop or a function in there that's responsible for composing, fails to catch it gracefully leading to a crash-and-dump-stack-trace. Would it be accurate to rephrase the todo as about percolating the graceful catch?

yes that's correct. The proving job error struct has a specific error variant for that case, and the composing caller doesn't handle it, but rather turns it into a panic. Which tokio would simply swallow if we weren't using abort=panic right now. Here is the handling code:

            Ok(Err(e)) = &mut composer_task => {
                match e.downcast_ref::<prover_job::ProverJobError>() {
                    Some(prover_job::ProverJobError::ProofComplexityLimitExceeded{..} ) => {
                        pause_mine = true;
                        tracing::warn!("exceeded proof complexity limit.  mining paused.  details: {}", e.to_string())
                    },
                    // tbd: perhaps any composer error should pause mining rather than panic.
                    _ => panic!("unexpected error while composing: {:?}", e),
                }
            },

I don't think retrying is the correct solution here.

That's fair. As long as the error messages are helpful, the user themselves will retry. No need to automate the process.

yeah, open to suggestions about better log message. Just keep in mind the fundamental problem that we do not get any code or information that indicates the reason why the process exited without an exit code, and there can be different reasons, and it also varies by OS. So the message "proving process did not return any exit code" really conveys all we know at that point.

If we were to change the message to eg "ran out of memory" then the log would be lying if the process is killed because user did a kill -9 on it, or some other reason.

there may be something better we can do... I'm just trying to convey here it is not as simple as it may first appear.

If you are interested in details, you can check out prover_job::VmProcessError, which has a variant for every possible error when invoking the external prover process. The calling method is ProverJob::prove_out_of_process().

Alternate proposal: The proving process could simply check available RAM when it starts, and set the necessary environment if it needs to.

The problem is that we don't really know the correct formula for the requisite amount of RAM to generate proof. It depends on a variety of factors, and we know more or less qualitatively how they affect RAM consumption. The best we have at this point is a heuristic.

fair enough. Still if we could determine a safe lower bound, then that could be used as the limit and seems like a reasonable solution for now...

dan-da · 2025-01-13T17:44:34Z

adding mainnet tag, as we shouldn't crash the node. Most people do not have enough ram to prove, and at least some will try anyway. They should get a decent error message.

aszepieniec · 2025-01-13T20:58:02Z

Done in ea6188e and 2f40300. See also #310.

aszepieniec mentioned this issue Jan 4, 2025

feat(mine_loop): Graceful shutdown on composition failure #310

Merged

dan-da added the mainnet issues that must be resolved before launching mainnet label Jan 13, 2025

dan-da mentioned this issue Jan 13, 2025

Node crashes in peer_loop #285

Closed

aszepieniec closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Crashes when Proving with too Little RAM #305

Node Crashes when Proving with too Little RAM #305

aszepieniec commented Dec 27, 2024 •

edited

Loading

dan-da commented Dec 31, 2024 •

edited

Loading

aszepieniec commented Jan 1, 2025

dan-da commented Jan 2, 2025

dan-da commented Jan 13, 2025

aszepieniec commented Jan 13, 2025

Node Crashes when Proving with too Little RAM #305

Node Crashes when Proving with too Little RAM #305

Comments

aszepieniec commented Dec 27, 2024 • edited Loading

dan-da commented Dec 31, 2024 • edited Loading

aszepieniec commented Jan 1, 2025

dan-da commented Jan 2, 2025

dan-da commented Jan 13, 2025

aszepieniec commented Jan 13, 2025

aszepieniec commented Dec 27, 2024 •

edited

Loading

dan-da commented Dec 31, 2024 •

edited

Loading