Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: failed to send proof when proving blocks in real mode #389

Closed
atanmarko opened this issue Jul 12, 2024 · 17 comments
Closed

error: failed to send proof when proving blocks in real mode #389

atanmarko opened this issue Jul 12, 2024 · 17 comments
Assignees
Labels
bug Something isn't working crate: zero_bin Anything related to the zero-bin subcrates.

Comments

@atanmarko
Copy link
Contributor

Proving blocks 20278565..20278585.json in real mode fails with error.

@atanmarko atanmarko self-assigned this Jul 12, 2024
@Nashtare Nashtare added bug Something isn't working crate: zero_bin Anything related to the zero-bin subcrates. labels Jul 12, 2024
@Nashtare Nashtare added this to the Testing and Validation milestone Jul 12, 2024
@atanmarko
Copy link
Contributor Author

atanmarko commented Jul 30, 2024

@Nashtare Runnig the proving with both native tracer in real mode, or zero tracer with test only mode, there is a kernel panic at block 20278567 - seems transaction 120. Not sure if this is related with the failed to send proof that was observed.

2024-07-30T11:09:13.093232Z  WARN p_gen: evm_arithmetization::witness::transition: Kernel panic at panic     id="b20278567 - 120"
2024-07-30T11:09:13.096186Z  INFO p_gen: ops: txn proof (db52d67c7a21afa801b7dc3a581388ff49876f87dd5861b89a60b4f9bce25e96) took 362.877766ms id="b20278567 - 120"
2024-07-30T11:09:13.096319Z ERROR paladin::runtime: execution error: Fatal { err: KernelPanic in kernel at pc=panic, stack=[911], memory=[72, 56, 177, 6, 252, 233, 100, 123, 223, 30, 120, 119, 191, 115, 206, 139, 11, 173, 95, 151, 250, 207, 155, 221, 95, 181, 140, 111, 74, 13, 248, 187, 229, 36, 142, 43, 62, 104, 48, 207, 2, 149, 95, 200, 251, 110, 37, 116, 189, 137, 18, 152, 2, 238, 129, 133, 134, 239, 221, 195, 65, 10, 182, 29]

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
   1: evm_arithmetization::generation::state::State::run_cpu
   2: evm_arithmetization::prover::testing::simulate_execution
   3: <ops::TxProof as paladin::operation::Operation>::execute
   4: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
   5: tokio::runtime::task::raw::poll
   6: tokio::runtime::blocking::pool::Inner::run
   7: std::sys::backtrace::__rust_begin_short_backtrace
   8: core::ops::function::FnOnce::call_once{{vtable.shim}}
   9: std::sys::pal::unix::thread::Thread::new::thread_start
  10: <unknown>
  11: <unknown>, strategy: Terminate } routing_key=005d1ba83ee5473490506f3c4cb2a66c
2024-07-30T11:09:13.100555Z  WARN paladin::runtime: received IPC termination signal routing_key=005d1ba83ee5473490506f3c4cb2a66c
2024-07-30T11:09:13.100589Z  WARN paladin::runtime: task cancelled via IPC sigterm routing_key=005d1ba83ee5473490506f3c4cb2a66c
2024-07-30T11:09:13.102730Z  WARN paladin::runtime: received IPC termination signal routing_key=005d1ba83ee5473490506f3c4cb2a66c
2024-07-30T11:09:13.102750Z  WARN paladin::runtime: task cancelled via IPC sigterm routing_key=005d1ba83ee5473490506f3c4cb2a66c
2024-07-30T11:09:13.207399Z  INFO p_gen: evm_arithmetization::generation::state: CPU halted after 2469795 cycles     id="b20278567 - 103"

@atanmarko
Copy link
Contributor Author

Attached is the block that fails
b20278567.json

@atanmarko
Copy link
Contributor Author

CC @LindaGuiga

@atanmarko atanmarko assigned Nashtare and unassigned atanmarko Jul 30, 2024
@Nashtare
Copy link
Collaborator

@atanmarko what's the status on this? I saw you assigned me back from when I was in Mountain View but I don't remember the context. I saw you pinged @LindaGuiga, did it get resolved?

@atanmarko
Copy link
Contributor Author

@Nashtare When I bumped into kernel panic at block 20278567 as mentioned above I have left this ticket to the crypto team to investigate further.

@Nashtare
Copy link
Collaborator

Ok. Do you know if @LindaGuiga looked at it? Otherwise I'll pick it up

@Nashtare
Copy link
Collaborator

FWIW, this seems addressed by #480. The entire block is passing fine now. I'll close this, and reopen whenever we actually encounter a failed to send proof issue.

@Nashtare Nashtare reopened this Aug 20, 2024
@Nashtare Nashtare reopened this Aug 20, 2024
@Nashtare
Copy link
Collaborator

Nashtare commented Aug 20, 2024

@atanmarko Re-opening as it seems to still occur in some occasions.
I have a fairly large payload (245 blocks at once) that I am sending to my leader in stdio mode, test-only feature activated. It goes through most of them, but still ends up hitting a failed to send proof.

RUST_LOG=info RUST_MIN_STACK=33554432 ./target/release/leader --runtime in-memory -b 15 -m 19 -n 8 --save-inputs-on-error stdio < 445..689.json &> 445..689.log

and I set beforehand ulimit -n 65536.
Not sure whether I should just bump the file descriptor limit, or if this is highlighting another underlying issue..

@Nashtare
Copy link
Collaborator

The payload:
445..689.json.xz.zip

@BGluth
Copy link
Contributor

BGluth commented Aug 20, 2024

but still ends up hitting a failed to send proof.

Is there anything else in the error? If this is all there is, it sounds like the rest of it is getting lost somewhere.

@atanmarko
Copy link
Contributor Author

@BGluth This is the tokio channel failing to send, meaning something nasty happened with async engine. Will look into this.

@BGluth
Copy link
Contributor

BGluth commented Aug 20, 2024

Hmm... It would be strange if it was not something related to too many handles.

@Nashtare
Copy link
Collaborator

Yeah as Marko said, the error msg even with full trace is not super meaningful:

Error: Failed to send proof

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
   1: <futures_util::future::future::flatten::Flatten<Fut,<Fut as core::future::future::Future>::Output> as core::future::future::Future>::poll
   2: <futures_util::future::future::Then<Fut1,Fut2,F> as core::future::future::Future>::poll
   3: <futures_util::stream::futures_unordered::FuturesUnordered<Fut> as futures_core::stream::Stream>::poll_next
   4: <futures_util::stream::try_stream::try_collect::TryCollect<St,C> as core::future::future::Future>::poll
   5: prover::ProverInput::prove::{{closure}}
   6: tokio::runtime::context::runtime::enter_runtime
   7: tokio::runtime::runtime::Runtime::block_on
   8: leader::main
   9: std::sys::backtrace::__rust_begin_short_backtrace
  10: std::rt::lang_start::{{closure}}
  11: std::rt::lang_start_internal
  12: main
  13: <unknown>
  14: __libc_start_main
  15: _start

@BGluth
Copy link
Contributor

BGluth commented Aug 20, 2024

I think the rest of the error chain is getting truncated here:

if tx.send(proof).is_err() {
anyhow::bail!("Failed to send proof");
}

Maybe we want to use with_context instead so we can keep the underlying error?

@Nashtare Nashtare removed their assignment Aug 26, 2024
@atanmarko atanmarko self-assigned this Aug 30, 2024
@atanmarko
Copy link
Contributor Author

@BGluth @Nashtare Doing some refactor now related to FollowFrom implementation that will affect dynamics of proving tasks execution, will get to the bottom of this issue afterwards.

@atanmarko
Copy link
Contributor Author

atanmarko commented Aug 30, 2024

Maybe we want to use with_context instead so we can keep the underlying error?

@BGluth @Nashtare Problem here is that tokio oneshot channel send function just returns element if it fails, so no internal engine error details are available.

pub fn send(mut self, t: T) -> Result<(), T> {

@atanmarko
Copy link
Contributor Author

Failed to send proof happens because next block fails to generate proof, exits early and receiver is destroyed. Improved error messages are implemented in PR #582 and new issue is created for graceful shutdown implementation in case of the error #584

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working crate: zero_bin Anything related to the zero-bin subcrates.
Projects
Status: Done
Development

No branches or pull requests

3 participants